Senior Production Engineer

Deepstreamtech · San Francisco, CA, USA · 1 months ago

Pay:: 80.000 - 100.000
Job type:: Full Time

Requirements

Bachelor's degree in Computer Science, a related technical field involving software engineering, or equivalent practical experience

Proficient in at least one modern programming language

Systematic problem-solving methods, effective communication skills

The position may require access to U.S. export-controlled technologies, technical data, or sensitive government data

Employment with Snowflake is contingent on Snowflake verifying that you: (i) may legally access U.S. export-controlled technologies, technical data, or sensitive government data; or (ii) are eligible to obtain, in a timely manner, any necessary license or other authorization(s) from the U.S. Government to allow access to U.S. export-controlled technology, technical data, or sensitive government data

Experience with capacity and load testing of the distributed applications (Desirable)

Experience with containers and container orchestration systems such as Kubernetes (Desirable)

Experience in deploying, managing, and operating scalable and fault tolerant Linux infrastructure (Desirable)

Experience with the SLO-driven reliability management processes (Desirable)

Hands on experience with one or more public cloud providers (AWS, Azure, or GCP) (Desirable)

Ability to prioritize tasks and work independently (Desirable)

What the job involves

The Production Engineering Team at Snowflake is responsible for driving the reliability tools and processes that ensure Snowflake consistently delivers a top-tier experience for its customers. This includes championing Service Level Objectives (SLOs) across all of Engineering, building the infrastructure necessary for rapid detection of reliability issues, and deeply engaging in system health verification after releases

We think about production reliability end-to-end: how do we proactively prevent issues, quickly detect and diagnose problems when they arise, and efficiently resolve them to minimize impact. We drive the culture of learning from every incident

Engage in and improve the whole lifecycle of services—from inception and design, deployment, operation, and refinement

Scale systems sustainably by automation; Drive changes that improve reliability and velocity

Establish and practice low noise incident response rotations and blameless postmortems to prevent problem recurrence

Write and review code. Develop documentation and capacity plans, and debug the hardest problems on large distributed systems

Collaborate with software engineers to establish, maintain, and optimize functional and performance SLOs

Participate in a 12x7 on-call rotation

#J-18808-Ljbffr