Staff+ Production Engineer

Sanas · Palo Alto, CA, USA · 1 months ago

Pay:: 80.000 - 100.000
Job type:: Full Time

About the Role
We are seeking a Production Engineer to take complete, end-to-end ownership of the infrastructure that builds, deploys, and operates our high-scale, real-time speech AI platform globally. This role involves architecting and delivering deployment infrastructure across public and corporate private clouds (AWS, Azure, and GCP). You will champion the entire production lifecycle, ensuring operational excellence across internal development workflows, deployment pipelines, world-class observability (telemetry, monitoring, alerting), and robust on-call systems for high availability and globally-distributed production environments.

Job Description

Design and implement mission-critical, multi-region deployment infrastructure on AWS to ensure exceptional scalability and fault tolerance.

Lead the architecture for deploying the platform into corporate private cloud environments across diverse architectures (AWS, Azure, GCP).

Drive developer velocity by engineering and maintaining rapid, high-confidence development, testing, and staging deployment workflows.

Establish and maintain a cohesive, state-of-the-art telemetry, monitoring, and alerting ecosystem to achieve deep operational observability and industry-leading system uptime.

Standardize and orchestrate world-class on-call documentation, incident response processes, and post-mortem procedures.

Proactively build robust self-service tools, systems, and comprehensive documentation that empowers engineering teams to manage and scale their own services.

Qualifications

5+ years of deep, hands-on expertise as a Platform/Production engineer building, scaling, and operating critical, high-availability production environments on AWS.

Proven mastery of infrastructure-as-code and containerization technologies, specifically Terraform and Docker.

Expert-level experience designing and implementing advanced cloud monitoring and observability systems (e.g., Datadog, NewRelic, Prometheus/Grafana).

Advanced capability in building and maintaining internal tooling and automation using Python or Rust.

Possess strong "engineering taste" and the ability to define, champion, and enforce a high-quality bar for operational excellence and site reliability across the entire organization.

Excellent cross-functional communication and consensus-building skills, with a focus on clearly documenting and describing complex distributed systems to technical and non-technical audiences.

Demonstrated track record in cost optimization and resource management in a high-scale cloud environment, balancing efficiency with performance needs.

Bonus

Significant deployment and operational experience across multiple major cloud environments (Azure or GCP).

Deep experience with GPU performance tuning, resource orchestration, and scaling for AI/ML workloads.

Familiarity with the unique challenges of real-time, low-latency systems, particularly conversational speech or streaming services, aligning with Sanas's core voice pipelines.

Experience designing and operating robust self-serve developer platforms, including features like usage-based billing, quota management, or tiered access controls.

#J-18808-Ljbffr