Mediabistro logo
job logo

Sr Production Engineer

IMCS Group Inc, La Honda, CA, USA

Job type: Full Time



Primary responsibilities (daily/weekly)?
- Operate, monitor, and improve GKE apps, Analytics, and ML production workloads
- Manage Terraform/Ansible/Helm IaC for GCP resource provisioning and policy enforcement
- Participate in on-call rotation for production incidents
- Review and improve CI/CD pipelines for services deployed in Python, Node.js, and Java
- Collaborate with architects and developers on infrastructure architecture and design
- Automate cloud operations through programmable and secure solutions
- Leverage AI-driven tools for development agents, troubleshooting, and automation

Key projects or initiatives for the role?
- On-prem to GCP migration of large-scale *** Mail workloads
- Analyti- Analytics pipeline and reliability improvementsplatform work (Vertex AI, Generative AI, BigQuery, Looker, Dataproc)

Success metrics or KPIs for this role?
- On-call incident resolution time and escalation rate (MTTD*** MTTR, MTTE)
- Terraform/IaC coverage of managed resources
- CI/CD pipeline reliability and deployment velocity
- Progress on on-prem to GCP migration milestones
- Sprint goal achievement (SMART goals per sprint)

How is success measured?
Quarterly OKR reviews, sprint velocity, incident retrospective outcomes, and peer/manager feedback. Candidates are expected to actively participate in sprint planning, backlog refining, and prioritization sessions.

Quals--
How does this role fit within the team/department?
This role sits within *** Mail's Production Engineering. Engineers in this role directly support cloud infrastructure reliability, cost efficiency, and automation for one of the world's largest consumer email platforms, serving hundreds of millions of users globally.

Overview of the team
*** Mail Production Engineering manages GCP-based infrastructure including GKE clusters, Compute Engine, Dataproc, Vertex AI and more gcp services. The team is responsible for production reliability, capacity planning, cost optimization, CI/CD pipelines, MLOPS, and infrastructure-as-code across 40+ GCP projects on an extra large, petabyte data size scale. We work in close collaboration with software architects, developers and product managers to deliver end to end results.

Key team goals?
- Maintain high availability for *** Mail
- Facilitate migration of large-scale on-premises applications to GCP
- Automation with IaC (Terraform, Ansible, Helm), CI/CD, and AI-assisted toolings
- Improve observability and incident response velocity (MTTD, MTTR)
- Enforce security standards, compliance (e.g. SOC2)
- FinOps (Optimize Cost efficiency, forecast accuracy and spending monitoring)

Must-have skills/qualifications (technical, soft skills, certifications, tools)?
Technical (Required):
- 5+ years in SRE, DevOps, Infrastructure, or Cloud Operations with on-call duties
- GCP services proficiency: GKE, GCE, Networking, Security, CI/CD, and common cloud technologies
- IaC proficiency: Terraform, Ansible, and Helm Charts
- Programming in Python, Node.js, and Java; ability to build CI/CD pipelines in these languages
- Linux, TCP/IP, HTTP, mail protocols, DNS, CDN, load balancers, and troubleshooting
- Experience with large-scale production applications, systems, and networks

Technical (Advantageous):
- Cloud databases and storage: GCS, Cloud SQL, Spanner, Memorystore
- ML/AI platforms: Vertex AI, Generative AI, BigQuery, Looker, Dataproc
- Cloud Observability and OpenTelemetry
- Proven track record migrating on-prem infrastructure to GCP
- Operational experience in both on-prem and cloud environments

Soft Skills:
- Collaborative team player with a continuous improvement mindset
- Strong communication and presentation skills to explain cloud architecture to senior engineers
- Ability to set SMART sprint goals; experience with agile/scrum
- Proficient in integrating Generative AI and LLM tools into daily engineering workflows

Ideal experience level (years, leadership, industries)?
5+ years total cloud/SRE experience, with preference for GCP. Experience at large-scale internet companies with petabytes level data production systems is strongly preferred.

Any preferred industries or companies for background?
Large-scale SaaS, cloud-native companies, or major internet platforms (e.g. Google, Meta, LinkedIn, Salesforce, Amazon). Experience with email, messaging, or real-time communication infrastructure is a strong plus.

Desired personality or work style?
Self-directed, proactive, collaborative, and low-ego. Embraces AI tools as a force multiplier rather than a crutch. Thrives in fast-paced, high-ownership environments with evolving priorities. Continuous learner.

Key attributes or values sought in the candidate?
- Proven track record in GCP migrations from on-prem environments
- Ability to build cloud services from scratch using IaC
- Collaborative with strong agile/sprint discipline
- Good presentation and communication skills for technical architecture discussions
- Proficient in integrating GenAI/LLM tools into the engineering lifecycle

Interview: 2-3 rounds of interview

Location and remote work options?
Remote, US-based. Occasional travel to *** offices (Sunnyvale, CA or New York, NY) for US-based candidates only.

Challenges in hiring for this role in the past (if applicable)?
- Difficult to find candidates with both GCP expertise and large scale experience, especially on Analytics Infrastructure
- Screening for on-call readiness and real ownership mentality vs. order-takers

Q: What has worked well in hiring for similar roles?
- Candidates from high-scale internet companies with demonstrated production ownership
- Live troubleshooting exercises that reveal real problem-solving approach under pressure

Any additional details or red flags to note about the role or candidate?
Red flags:
Lack of hands-on GCP production experience;
no real IaC/Terraform ownership;
not comfortable with on-call commitment
US-based; we do not permit work outside the US, even remote. call during interview or having trouble to handle interview technical requirements (e.g. mic issue, cannot share screen, camera cannot turn on & etc)