
Rhoda ai is hiring: Machine Learning Engineer - Training Platform in Palo Alto
Rhoda ai, Palo Alto, CA, United States
At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.
We're looking for a Staff / Principal ML Engineer to build and own our training platform — the system that makes large-scale training reliable, reproducible, and easy to run. You will define how training jobs are launched, tracked, recovered, and debugged across the cluster. Your work ensures that researchers can move fast without fighting infrastructure.
This role sits at the core of research velocity: when training fails → you make it recover automatically. When experiments are hard to reproduce → you fix the system. When GPU-hours are wasted → you make it visible and preventable.
What You'll Do
Own the training job lifecycle
Design and build systems for job launch and configuration, monitoring and state tracking, automatic retry and resume, and failure handling and recovery
Define clean, scalable interfaces for running distributed training: CLI / SDK / config systems and standardized launch templates across model families
Build robust checkpointing and recovery systems
Develop checkpointing systems that are reliable (no silent corruption or mismatch), efficient (fast save/load at scale), and flexible (support sharded and distributed models)
Enable seamless resume from failures, partial recovery (e.g., node/rank failures), and consistent state across distributed jobs
Make training reproducible and debuggable
Build systems for experiment configuration and versioning, tracking training state, metrics, and lineage, and reproducible "golden runs" and configs
Ensure runs can be reliably reproduced and differences between runs are explainable
Make performance and failures observable
Create unified visibility into per-job behavior (failures, slowdowns, anomalies) and fleet-wide trends (GPU utilization, failure modes, wasted compute)
Partner with training systems engineers to surface step-time breakdowns, resource inefficiencies, and failure patterns across jobs
Reduce operational burden on researchers
Eliminate manual debugging and babysitting of training jobs
Provide clean abstractions so researchers don't need to think about cluster quirks, retry logic, or distributed setup details
Goal: make large-scale training feel simple and reliable
Collaborate with infra / SRE on cluster reliability
Work with infrastructure teams to reduce GPU waste from node failures, network instability, checkpointing/storage bottlenecks, and scheduler placement issues
What We're Looking For
Strong experience building distributed systems or ML infrastructure
Experience with large-scale training environments (preferred but not required)
Hands-on experience with modern ML stacks (e.g., PyTorch; JAX a plus)
Solid understanding of distributed systems fundamentals (fault tolerance, state management, retries), training workflows and failure modes, and checkpointing and data consistency challenges
Strong product / systems instincts — you build tools people actually want to use and simplify complex workflows into clean abstractions
High ownership mindset and comfort in a fast-moving environment
Nice to Have (But Not Required)
Experience with checkpointing for large distributed models (FSDP / ZeRO / sharded states)
Experience with cluster schedulers (Slurm, Kubernetes, Ray, etc.)
Experience building experiment tracking or ML observability systems
Familiarity with large-scale storage systems and I/O bottlenecks
Why This Role
Own the reliability layer that every training run in the company depends on — your systems are the foundation research velocity is built on
Direct impact on developer experience and research throughput at a company building real-world embodied intelligence, not toy ML pipelines
High ownership in a small, elite team where your infrastructure decisions compound across every model the research team trains
#J-18808-Ljbffr
We're looking for a Staff / Principal ML Engineer to build and own our training platform — the system that makes large-scale training reliable, reproducible, and easy to run. You will define how training jobs are launched, tracked, recovered, and debugged across the cluster. Your work ensures that researchers can move fast without fighting infrastructure.
This role sits at the core of research velocity: when training fails → you make it recover automatically. When experiments are hard to reproduce → you fix the system. When GPU-hours are wasted → you make it visible and preventable.
What You'll Do
Own the training job lifecycle
Design and build systems for job launch and configuration, monitoring and state tracking, automatic retry and resume, and failure handling and recovery
Define clean, scalable interfaces for running distributed training: CLI / SDK / config systems and standardized launch templates across model families
Build robust checkpointing and recovery systems
Develop checkpointing systems that are reliable (no silent corruption or mismatch), efficient (fast save/load at scale), and flexible (support sharded and distributed models)
Enable seamless resume from failures, partial recovery (e.g., node/rank failures), and consistent state across distributed jobs
Make training reproducible and debuggable
Build systems for experiment configuration and versioning, tracking training state, metrics, and lineage, and reproducible "golden runs" and configs
Ensure runs can be reliably reproduced and differences between runs are explainable
Make performance and failures observable
Create unified visibility into per-job behavior (failures, slowdowns, anomalies) and fleet-wide trends (GPU utilization, failure modes, wasted compute)
Partner with training systems engineers to surface step-time breakdowns, resource inefficiencies, and failure patterns across jobs
Reduce operational burden on researchers
Eliminate manual debugging and babysitting of training jobs
Provide clean abstractions so researchers don't need to think about cluster quirks, retry logic, or distributed setup details
Goal: make large-scale training feel simple and reliable
Collaborate with infra / SRE on cluster reliability
Work with infrastructure teams to reduce GPU waste from node failures, network instability, checkpointing/storage bottlenecks, and scheduler placement issues
What We're Looking For
Strong experience building distributed systems or ML infrastructure
Experience with large-scale training environments (preferred but not required)
Hands-on experience with modern ML stacks (e.g., PyTorch; JAX a plus)
Solid understanding of distributed systems fundamentals (fault tolerance, state management, retries), training workflows and failure modes, and checkpointing and data consistency challenges
Strong product / systems instincts — you build tools people actually want to use and simplify complex workflows into clean abstractions
High ownership mindset and comfort in a fast-moving environment
Nice to Have (But Not Required)
Experience with checkpointing for large distributed models (FSDP / ZeRO / sharded states)
Experience with cluster schedulers (Slurm, Kubernetes, Ray, etc.)
Experience building experiment tracking or ML observability systems
Familiarity with large-scale storage systems and I/O bottlenecks
Why This Role
Own the reliability layer that every training run in the company depends on — your systems are the foundation research velocity is built on
Direct impact on developer experience and research throughput at a company building real-world embodied intelligence, not toy ML pipelines
High ownership in a small, elite team where your infrastructure decisions compound across every model the research team trains
#J-18808-Ljbffr