Java SRE Engineer (Phoenix)

Atos, Phoenix, AZ, United States

Must have skills includes Application Support, Observability tool knowledge with distributed background with tech stack of Java/Python. Having knowledge or work experience in AI is preferred.

Supporting applications in production, including incident response, on-call rotations, and post-incident reviews
Applying observability engineering to our applications — defining SLOs/SLIs/error budgets, building dashboards, and implementing alerting strategies to proactively detect system degradation before customers are impacted
Investigating and resolving production issues, including performance tuning and capacity planning
Building automation to reduce toil and improve developer productivity
Driving independent initiatives to improve platform reliability, developer experience, and operational maturity

Site Reliability Engineer Responsibilities
Proactively identify reliability risks and independently drive initiatives to address them before they become incidents
Participate in and continuously improve our on-call rotation, including incident response, triage, and leading blameless post-incident reviews
Define and implement monitoring, logging, and distributed tracing strategies; build and maintain dashboards; set meaningful alerts; and drive SLO/SLI/SLA and error budget adoption across services
Scope technical projects and break them down into user stories and tasks, driving them to completion with minimal oversight
Make sound technical decisions, leveraging input from teammates and contributing to technical conversations across engineering teams
Automate the provisioning and management of infrastructure using Infrastructure as Code (IaC) tools such as Terraform

A good fit will have
At least 5 years of experience working in a professional environment as a Site Reliability Engineer (or a Software Engineer with some SRE responsibilities)
Strong hands-on experience with observability — you understand the difference between monitoring and observability, and can articulate how metrics, logs, and traces work together
Participated in on-call rotations and are comfortable leading incident response under pressure, communicating clearly with stakeholders throughout
Comfortable taking ownership of initiatives or projects independently, from scoping through to delivery, without needing constant direction
Contributed to the design, build, and operation of cloud-native applications
Experience with automating repeatable tasks and processes
Build effective working relationships, give and receive constructive feedback openly, and are trusted by colleagues at all levels

Technologies we use include
Python, Java, and Go are our primary server languages
Our browser applications are based on Angular and React
Code lives in GitHub and flows to production through a CI/CD pipeline built on GitHub Actions, with some workloads on Jenkins
Infrastructure runs on AWS (EC2) with workloads on Kubernetes-managed Docker containers
Datadog is our primary observability platform — experience with Datadog APM, dashboards, monitors, and RUM is a plus
Infrastructure is managed as code using Terraform