Member of Technical Staff - Post Training (Sunnyvale)

Cerebro, Sunnyvale, CA, United States

We’re building the

post-training and evaluation layer for AI agents .

As models gain the ability to use tools, browse, and operate software, a new problem is emerging:
They don’t reliably

work .
They fail mid-task, behave inconsistently, and are extremely hard to evaluate or improve in real-world environments.

We’re solving this by building:
High-fidelity, resettable environments

for training agents
Evaluation systems that can automatically grade behaviour
Training loops that continuously improve real-world performance

Our systems are already used by

leading AI teams

to train and evaluate agentic models.
The Role

We’re hiring a

Founding MTS (Post-Training / Applied ML)

to build and scale systems that

make models actually useful in production .

This is not a pure research role.

You’ll focus on:
Taking models that “kind of work” → making them

reliable, measurable, and deployable
Designing and running

training + evaluation loops at scale
Rapid experimentation on

real-world tasks and environments

You’ll operate in the space between:
ML engineering
Post-training / RL
Agent systems

What You’ll Work On
Building

post-training pipelines

for agent behaviour
Fine-tuning, RL, dataset iteration
Improving multi-step task completion
Designing

evaluation systems that reflect real-world success LLM-as-judge, programmatic evals, hybrid approaches
Running tight

experiment loops :
Identify failure modes
Generate data
Retrain
Measure improvements
Improving:
Reliability across long-horizon tasks
Tool use and environment interaction
Consistency and robustness
Shipping systems that are

used daily to train real models

Who This Is For
We’re looking for

applied builders , not pure researchers.

Strong candidates often come from:
Applied ML / post-training teams at frontier labs
Early-stage AI startups working on agents or LLM products
Infra teams working on evaluation, fine-tuning, or deployment

You might have:
Experience with:
LLM fine-tuning / RLHF / RLAIF
Agent systems or tool use
Evaluation frameworks or benchmarking
A track record of:
Shipping ML systems into production
Running fast, iterative experiments
Comfort working in messy, real-world problem spaces
Strong engineering instincts alongside ML knowledge

What Makes This Role Different
Applied, not academic
You’re judged on whether the system works—not papers
Tight feedback loops
You’ll see the impact of your work immediately
Real-world complexity
Not toy benchmarks—messy, dynamic environments
High ownership
You’ll define core systems and how they evolve
Upstream of the ecosystem
Your work improves how entire teams train agents

Why This Matters
The biggest gap in AI right now isn’t pretraining—it’s

post-training and reliability .
Models can generate actions
But they can’t consistently complete tasks
And we don’t have good ways to measure or fix that

We’re building the systems that close that gap.

If successful, this unlocks:
Reliable AI agents
End-to-end automation
Production-grade AI systems