
Member of Technical Staff - Post Training (Sonoma)
Cerebro, Sonoma, CA, United States
We’re building the
post-training and evaluation layer for AI agents .
As models gain the ability to use tools, browse, and operate software, a new problem is emerging:
They don’t reliably
work .
They fail mid-task, behave inconsistently, and are extremely hard to evaluate or improve in real-world environments.
We’re solving this by building:
High-fidelity, resettable environments
for training agents
Evaluation systems that can automatically grade behaviour
Training loops that continuously improve real-world performance
Our systems are already used by
leading AI teams
to train and evaluate agentic models.
The Role
We’re hiring a
Founding MTS (Post-Training / Applied ML)
to build and scale systems that
make models actually useful in production .
This is not a pure research role.
You’ll focus on:
Taking models that “kind of work” → making them
reliable, measurable, and deployable
Designing and running
training + evaluation loops at scale
Rapid experimentation on
real-world tasks and environments
You’ll operate in the space between:
ML engineering
Post-training / RL
Agent systems
What You’ll Work On
Building
post-training pipelines
for agent behaviour
Fine-tuning, RL, dataset iteration
Improving multi-step task completion
Designing
evaluation systems that reflect real-world success LLM-as-judge, programmatic evals, hybrid approaches
Running tight
experiment loops :
Identify failure modes
Generate data
Retrain
Measure improvements
Improving:
Reliability across long-horizon tasks
Tool use and environment interaction
Consistency and robustness
Shipping systems that are
used daily to train real models
Who This Is For
We’re looking for
applied builders , not pure researchers.
Strong candidates often come from:
Applied ML / post-training teams at frontier labs
Early-stage AI startups working on agents or LLM products
Infra teams working on evaluation, fine-tuning, or deployment
You might have:
Experience with:
LLM fine-tuning / RLHF / RLAIF
Agent systems or tool use
Evaluation frameworks or benchmarking
A track record of:
Shipping ML systems into production
Running fast, iterative experiments
Comfort working in messy, real-world problem spaces
Strong engineering instincts alongside ML knowledge
What Makes This Role Different
Applied, not academic
You’re judged on whether the system works—not papers
Tight feedback loops
You’ll see the impact of your work immediately
Real-world complexity
Not toy benchmarks—messy, dynamic environments
High ownership
You’ll define core systems and how they evolve
Upstream of the ecosystem
Your work improves how entire teams train agents
Why This Matters
The biggest gap in AI right now isn’t pretraining—it’s
post-training and reliability .
Models can generate actions
But they can’t consistently complete tasks
And we don’t have good ways to measure or fix that
We’re building the systems that close that gap.
If successful, this unlocks:
Reliable AI agents
End-to-end automation
Production-grade AI systems
post-training and evaluation layer for AI agents .
As models gain the ability to use tools, browse, and operate software, a new problem is emerging:
They don’t reliably
work .
They fail mid-task, behave inconsistently, and are extremely hard to evaluate or improve in real-world environments.
We’re solving this by building:
High-fidelity, resettable environments
for training agents
Evaluation systems that can automatically grade behaviour
Training loops that continuously improve real-world performance
Our systems are already used by
leading AI teams
to train and evaluate agentic models.
The Role
We’re hiring a
Founding MTS (Post-Training / Applied ML)
to build and scale systems that
make models actually useful in production .
This is not a pure research role.
You’ll focus on:
Taking models that “kind of work” → making them
reliable, measurable, and deployable
Designing and running
training + evaluation loops at scale
Rapid experimentation on
real-world tasks and environments
You’ll operate in the space between:
ML engineering
Post-training / RL
Agent systems
What You’ll Work On
Building
post-training pipelines
for agent behaviour
Fine-tuning, RL, dataset iteration
Improving multi-step task completion
Designing
evaluation systems that reflect real-world success LLM-as-judge, programmatic evals, hybrid approaches
Running tight
experiment loops :
Identify failure modes
Generate data
Retrain
Measure improvements
Improving:
Reliability across long-horizon tasks
Tool use and environment interaction
Consistency and robustness
Shipping systems that are
used daily to train real models
Who This Is For
We’re looking for
applied builders , not pure researchers.
Strong candidates often come from:
Applied ML / post-training teams at frontier labs
Early-stage AI startups working on agents or LLM products
Infra teams working on evaluation, fine-tuning, or deployment
You might have:
Experience with:
LLM fine-tuning / RLHF / RLAIF
Agent systems or tool use
Evaluation frameworks or benchmarking
A track record of:
Shipping ML systems into production
Running fast, iterative experiments
Comfort working in messy, real-world problem spaces
Strong engineering instincts alongside ML knowledge
What Makes This Role Different
Applied, not academic
You’re judged on whether the system works—not papers
Tight feedback loops
You’ll see the impact of your work immediately
Real-world complexity
Not toy benchmarks—messy, dynamic environments
High ownership
You’ll define core systems and how they evolve
Upstream of the ecosystem
Your work improves how entire teams train agents
Why This Matters
The biggest gap in AI right now isn’t pretraining—it’s
post-training and reliability .
Models can generate actions
But they can’t consistently complete tasks
And we don’t have good ways to measure or fix that
We’re building the systems that close that gap.
If successful, this unlocks:
Reliable AI agents
End-to-end automation
Production-grade AI systems