Senior Production Engineer

Slope · Washington, DC, USA · 1 months ago

Pay:: $191,000-$287,000/yr
Job type:: Contract

Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century’s most innovative companies to the defense industry, Anduril is changing how military systems are designed, built and sold. Anduril’s family of systems is powered by Lattice OS, an AI-powered operating system that turns thousands of data streams into a realtime, 3D command and control center. As the world enters an era of strategic competition, Anduril is committed to bringing cutting‑edge autonomy, AI, computer vision, sensor fusion, and networking technology to the military in months, not years.

ABOUT THE TEAM
The

SRE

team owns reliability and infrastructure for Anduril's cloud deployments. We operate Kubernetes clusters, Terraform infrastructure, and observability platforms across 10+ production environments supporting active defense contracts. When platform services break under real operational load, we're the team that fixes them — often at the code level, not just the config level.

ABOUT THE JOB
We are looking for a

Senior Production Engineer

to join our team in

Costa Mesa, CA (or DC) . In this role, you will be responsible for diagnosing and fixing stability vulnerabilities in core platform services that cause cascading failures in multi‑tenant cloud deployments. You will write production Go to implement resilience patterns — leader election, circuit breakers, failure domain isolation — directly in service code. This will require deep experience with distributed systems, debugging complex failure modes across service boundaries, and writing production-quality Go. If you are someone who thrives on fixing hard reliability problems in live systems rather than building greenfield, this role is for you.

WHAT YOU'LL DO

Diagnose and fix stability vulnerabilities in core platform services that cause cascading failures under multi‑replica, multi‑tenant operation

Implement resilience patterns (leader election, circuit breakers, failure domain isolation) directly in service code

Design multi‑replica support for services that currently assume single‑instance operation

Collaborate with service owners on contract testing and upgrade validation

Trace cascading failures across service boundaries and drive them to root‑cause fixes

Contribute to observability platform improvements to support service stability

Light infrastructure work: Terraform/Kubernetes changes to support service fixes (~20% of time)

REQUIRED QUALIFICATIONS

Production‑quality Go — you'll be modifying core platform services, not writing scripts

Practical experience with distributed systems: leader election, consensus, replication, failure modes

Kubernetes — enough to understand how services run (not necessarily cluster administration)

Debugging complex systems — tracing cascading failures across service boundaries

4+ years in SRE, platform engineering, or backend development roles

Must be a U.S. Person due to required access to U.S. export controlled information or facilities

NICE-TO-HAVE QUALIFICATIONS

Rust (some platform services use it)

Experience fixing reliability problems in production services (not just building greenfield)

Familiarity with gRPC service architectures

HashiCorp Consul or similar service discovery/mesh

FedRAMP/IL5 compliance environment experience

ArgoCD / GitOps workflows

US Salary Range: $191,000 — $287,000 USD.

Benefits
At Anduril, we invest in our people. Our comprehensive, competitive benefits package (available at little to no cost to employees) ensures you’re supported in health, recovery, and whatever comes next. For more information, Explore Our Benefits.

#J-18808-Ljbffr