
Director, Site Reliability Engineer
Waystar, Atlanta, GA, United States
ABOUT THIS POSITION
We are seeking an experienced and strategic Director of Site Reliability Engineering (SRE) to lead and scale our SRE organization, overseeing four SRE teams responsible for the reliability, scalability, performance, and operational excellence of our most critical platforms and services.
This role is both highly technical and deeply people‑focused, requiring strong cloud expertise (GCP preferred or equivalent), hands‑on SRE experience, and a proven ability to set vision, standards, and direction across Site Reliability Engineering, Platform Engineering, and Infrastructure‑as‑Code (IaC) automation.
As a senior leader, the Director of SRE will partner closely with Engineering, Product, Architecture, Infrastructure, and Security leadership to embed reliability, automation, and resilience into every layer of the technology stack while enabling teams to move faster and more safely.
WHAT YOU'LL DO Leadership & Strategy
Provide strategic leadership and oversight for
four SRE teams , setting clear direction, priorities, and expectations aligned to business and engineering objectives.
Lead, mentor, and develop SRE managers and senior engineers, fostering a culture of accountability, operational ownership, innovation, and psychological safety.
Define and own the
SRE and Platform Engineering strategy and roadmap , ensuring alignment with cloud transformation initiatives and long‑term organizational goals.
Serve as a key voice in architectural and platform decisions, influencing designs with a focus on
scalability, reliability, automation, and operational efficiency .
Partner with executive leadership to communicate reliability posture, risks, and investment needs in clear business terms.
Reliability & Platform Engineering
Establish and continuously evolve
SRE principles and best practices , including SLIs, SLOs, error budgets, toil management, and reliability‑driven prioritization.
Provide technical direction and governance across
GCP (preferred) and AWS environments , ensuring consistent reliability and operational patterns.
Drive the evolution of
Platform Engineering , enabling self‑service infrastructure and guard‑railed service delivery for application teams.
Own strategy and standards for
Infrastructure‑as‑Code (IaC)
and automation, leveraging tools such as Terraform or equivalent frameworks across cloud environments.
Ensure observability excellence through metrics, logging, tracing, alerting, and proactive capacity and performance management.
Incident Management & Operational Resilience
Provide executive leadership during
large‑scale or high‑impact incidents , ensuring effective coordination, escalation, and stakeholder communication.
Define, refine, and scale
incident management and on‑call practices , emphasizing resilience, sustainability, and rapid recovery.
Champion
blameless postmortems , ensuring root causes are addressed and learnings are translated into systemic improvements.
Partner with Security and Compliance teams to ensure systems meet security, privacy, and regulatory requirements without compromising reliability.
Operational Excellence & Measurement
Own and report on
reliability metrics, operational KPIs, and service health
for leadership and executive stakeholders.
Drive continuous improvement through reliability reviews, retrospectives, and data‑driven decision‑making.
Balance reliability, velocity, and cost across platforms, applying error budgets and capacity planning to guide trade‑offs.
WHAT YOU'LL NEED
10+ years
of experience in SRE, infrastructure, platform, or systems engineering roles, with
5+ years leading managers and senior technical teams .
Direct, hands‑on experience in Site Reliability Engineering , including operating production systems at scale.
Strong experience with
Google Cloud Platform (GCP)
or equivalent public cloud (AWS or Azure), including distributed, cloud‑native architectures.
Proven expertise in
Infrastructure‑as‑Code (IaC)
and automation frameworks (e.g., Terraform or similar).
Deep understanding of observability ecosystems (metrics, logging, tracing), CI/CD pipelines, and DevOps/SRE tooling.
Ability to communicate complex technical concepts clearly to both technical and non‑technical stakeholders, influencing at all levels of the organization.
AI & Innovation Mindset
Leverage
AI‑assisted tools and platforms
to improve operational efficiency, incident response, reliability analysis, and engineering workflows.
Champion experimentation and continuous learning, applying emerging technologies to modernize reliability and platform practices.
Enable teams to responsibly adopt AI capabilities while maintaining reliability, security, and governance standards.
Preferred Qualifications
Experience with
Kubernetes, microservices architectures, and service meshes .
Familiarity with
chaos engineering, resilience testing, and failure injection
methodologies.
Background in performance engineering, capacity planning, or large‑scale platform migrations.
Experience leading reliability or platform initiatives during major cloud or organizational transformations.
WAYSTAR PERKS
Competitive total rewards (base salary + bonus, if applicable)
Customizable benefits package (3 medical plans with Health Saving Account company match)
We offer generous paid time off for our non‑exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays. We also offer flexible time off for our exempt team members + 13 paid holidays
Paid parental leave (including maternity + paternity leave)
Education assistance opportunities and free LinkedIn Learning access
Free mental health and family planning programs, including adoption assistance and fertility support
401(K) program with company match
Pet insurance
Employee resource groups
Waystar is proud to be an equal opportunity workplace. We celebrate, value, and support diversity and inclusion. Qualified applicants will receive consideration for employment without regard to race, color, religion, age, sex, national origin, disability status, genetics, marital status, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws.
This applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training.
Job Category:
Technology/Engineering
Job Type:
Full time
Req ID:
R3111
#J-18808-Ljbffr
This role is both highly technical and deeply people‑focused, requiring strong cloud expertise (GCP preferred or equivalent), hands‑on SRE experience, and a proven ability to set vision, standards, and direction across Site Reliability Engineering, Platform Engineering, and Infrastructure‑as‑Code (IaC) automation.
As a senior leader, the Director of SRE will partner closely with Engineering, Product, Architecture, Infrastructure, and Security leadership to embed reliability, automation, and resilience into every layer of the technology stack while enabling teams to move faster and more safely.
WHAT YOU'LL DO Leadership & Strategy
Provide strategic leadership and oversight for
four SRE teams , setting clear direction, priorities, and expectations aligned to business and engineering objectives.
Lead, mentor, and develop SRE managers and senior engineers, fostering a culture of accountability, operational ownership, innovation, and psychological safety.
Define and own the
SRE and Platform Engineering strategy and roadmap , ensuring alignment with cloud transformation initiatives and long‑term organizational goals.
Serve as a key voice in architectural and platform decisions, influencing designs with a focus on
scalability, reliability, automation, and operational efficiency .
Partner with executive leadership to communicate reliability posture, risks, and investment needs in clear business terms.
Reliability & Platform Engineering
Establish and continuously evolve
SRE principles and best practices , including SLIs, SLOs, error budgets, toil management, and reliability‑driven prioritization.
Provide technical direction and governance across
GCP (preferred) and AWS environments , ensuring consistent reliability and operational patterns.
Drive the evolution of
Platform Engineering , enabling self‑service infrastructure and guard‑railed service delivery for application teams.
Own strategy and standards for
Infrastructure‑as‑Code (IaC)
and automation, leveraging tools such as Terraform or equivalent frameworks across cloud environments.
Ensure observability excellence through metrics, logging, tracing, alerting, and proactive capacity and performance management.
Incident Management & Operational Resilience
Provide executive leadership during
large‑scale or high‑impact incidents , ensuring effective coordination, escalation, and stakeholder communication.
Define, refine, and scale
incident management and on‑call practices , emphasizing resilience, sustainability, and rapid recovery.
Champion
blameless postmortems , ensuring root causes are addressed and learnings are translated into systemic improvements.
Partner with Security and Compliance teams to ensure systems meet security, privacy, and regulatory requirements without compromising reliability.
Operational Excellence & Measurement
Own and report on
reliability metrics, operational KPIs, and service health
for leadership and executive stakeholders.
Drive continuous improvement through reliability reviews, retrospectives, and data‑driven decision‑making.
Balance reliability, velocity, and cost across platforms, applying error budgets and capacity planning to guide trade‑offs.
WHAT YOU'LL NEED
10+ years
of experience in SRE, infrastructure, platform, or systems engineering roles, with
5+ years leading managers and senior technical teams .
Direct, hands‑on experience in Site Reliability Engineering , including operating production systems at scale.
Strong experience with
Google Cloud Platform (GCP)
or equivalent public cloud (AWS or Azure), including distributed, cloud‑native architectures.
Proven expertise in
Infrastructure‑as‑Code (IaC)
and automation frameworks (e.g., Terraform or similar).
Deep understanding of observability ecosystems (metrics, logging, tracing), CI/CD pipelines, and DevOps/SRE tooling.
Ability to communicate complex technical concepts clearly to both technical and non‑technical stakeholders, influencing at all levels of the organization.
AI & Innovation Mindset
Leverage
AI‑assisted tools and platforms
to improve operational efficiency, incident response, reliability analysis, and engineering workflows.
Champion experimentation and continuous learning, applying emerging technologies to modernize reliability and platform practices.
Enable teams to responsibly adopt AI capabilities while maintaining reliability, security, and governance standards.
Preferred Qualifications
Experience with
Kubernetes, microservices architectures, and service meshes .
Familiarity with
chaos engineering, resilience testing, and failure injection
methodologies.
Background in performance engineering, capacity planning, or large‑scale platform migrations.
Experience leading reliability or platform initiatives during major cloud or organizational transformations.
WAYSTAR PERKS
Competitive total rewards (base salary + bonus, if applicable)
Customizable benefits package (3 medical plans with Health Saving Account company match)
We offer generous paid time off for our non‑exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays. We also offer flexible time off for our exempt team members + 13 paid holidays
Paid parental leave (including maternity + paternity leave)
Education assistance opportunities and free LinkedIn Learning access
Free mental health and family planning programs, including adoption assistance and fertility support
401(K) program with company match
Pet insurance
Employee resource groups
Waystar is proud to be an equal opportunity workplace. We celebrate, value, and support diversity and inclusion. Qualified applicants will receive consideration for employment without regard to race, color, religion, age, sex, national origin, disability status, genetics, marital status, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws.
This applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training.
Job Category:
Technology/Engineering
Job Type:
Full time
Req ID:
R3111
#J-18808-Ljbffr