Rhoda ai is hiring: Machine Learning Engineer - Training Platform in Palo Alto

Rhoda ai, Palo Alto, CA, United States

At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.

We're looking for a Staff / Principal ML Engineer to build and own our training platform — the system that makes large-scale training reliable, reproducible, and easy to run. You will define how training jobs are launched, tracked, recovered, and debugged across the cluster. Your work ensures that researchers can move fast without fighting infrastructure.

This role sits at the core of research velocity: when training fails → you make it recover automatically. When experiments are hard to reproduce → you fix the system. When GPU-hours are wasted → you make it visible and preventable.

What You'll Do
Own the training job lifecycle

Design and build systems for job launch and configuration, monitoring and state tracking, automatic retry and resume, and failure handling and recovery

Define clean, scalable interfaces for running distributed training: CLI / SDK / config systems and standardized launch templates across model families

Build robust checkpointing and recovery systems

Develop checkpointing systems that are reliable (no silent corruption or mismatch), efficient (fast save/load at scale), and flexible (support sharded and distributed models)

Enable seamless resume from failures, partial recovery (e.g., node/rank failures), and consistent state across distributed jobs

Make training reproducible and debuggable

Build systems for experiment configuration and versioning, tracking training state, metrics, and lineage, and reproducible "golden runs" and configs

Ensure runs can be reliably reproduced and differences between runs are explainable

Make performance and failures observable

Create unified visibility into per-job behavior (failures, slowdowns, anomalies) and fleet-wide trends (GPU utilization, failure modes, wasted compute)

Partner with training systems engineers to surface step-time breakdowns, resource inefficiencies, and failure patterns across jobs

Reduce operational burden on researchers

Eliminate manual debugging and babysitting of training jobs

Provide clean abstractions so researchers don't need to think about cluster quirks, retry logic, or distributed setup details

Goal: make large-scale training feel simple and reliable

Collaborate with infra / SRE on cluster reliability

Work with infrastructure teams to reduce GPU waste from node failures, network instability, checkpointing/storage bottlenecks, and scheduler placement issues

What We're Looking For

Strong experience building distributed systems or ML infrastructure

Experience with large-scale training environments (preferred but not required)

Hands-on experience with modern ML stacks (e.g., PyTorch; JAX a plus)

Solid understanding of distributed systems fundamentals (fault tolerance, state management, retries), training workflows and failure modes, and checkpointing and data consistency challenges

Strong product / systems instincts — you build tools people actually want to use and simplify complex workflows into clean abstractions

High ownership mindset and comfort in a fast-moving environment

Nice to Have (But Not Required)

Experience with checkpointing for large distributed models (FSDP / ZeRO / sharded states)

Experience with cluster schedulers (Slurm, Kubernetes, Ray, etc.)

Experience building experiment tracking or ML observability systems

Familiarity with large-scale storage systems and I/O bottlenecks

Why This Role

Own the reliability layer that every training run in the company depends on — your systems are the foundation research velocity is built on

Direct impact on developer experience and research throughput at a company building real-world embodied intelligence, not toy ML pipelines

High ownership in a small, elite team where your infrastructure decisions compound across every model the research team trains

#J-18808-Ljbffr