Only Citizen and Green Card preferred.
We are seeking a highly skilled
Production Engineer
to bridge the gap between application development and system operations. In this role, you will use your software engineering background to ensure our core platforms are highly available, scalable, and resilient.
You won't just monitor servers—you will dive directly into application code written in
Java and Spring Boot
to debug bottlenecks, automate infrastructure deployment on
AWS , and optimize production performance. If you approach operational challenges as software problems, we want you on our team.
Key Responsibilities:
Production Operations & Reliability:
Own end-to-end production environments. Lead incident response, conduct Root Cause Analysis (RCA), and optimize systems to meet strict SLA/SLO and MTTR targets.
Infrastructure as Code (IaC):
Treat infrastructure as software by writing clean, reusable
Terraform
or CloudFormation modules to automate cloud provisioning and eliminate manual drift.
Scalable Systems Architecture:
Partner with dev teams to architect fault-tolerant, cloud-native microservices utilizing automated failover, autoscaling, and traffic routing.
Continuous Delivery Automation:
Build, scale, and maintain robust CI/CD pipelines (Jenkins, GitLab CI, or AWS CodePipeline) to streamline automated testing and deployments.
Observability & Performance Tuning:
Design and manage centralized monitoring and distributed tracing stacks using
Prometheus, Grafana , AWS CloudWatch, and Jaeger/X-Ray to catch issues before they impact users.
Production Security:
Implement and enforce enterprise-grade security controls, including AWS IAM roles, OAuth2, JWT, and data encryption.
Required Skills & Qualifications:
Experience:
3–5 years of dedicated experience in Production Engineering, Site Reliability Engineering (SRE), or DevOps.
Backend Engineering:
Strong proficiency in
Java and Spring Boot
with the ability to read, trace, and debug complex microservice applications.
AWS & Containerization:
Hands-on experience with core cloud infrastructure, specifically
Docker, Kubernetes (EKS/ECS) , Lambda, SQS, SNS, and Application Load Balancers (ALB).
Automation:
Practical experience using
Terraform
for cloud infrastructure automation and scripting.
Telemetry Stack:
Deep practical knowledge of
Prometheus and Grafana
or AWS CloudWatch for real-time visibility.
Environment:
Comfortable working in fast-paced Agile/Scrum environments and participating in production on-call rotations.
What Will Make You Stand Out
Proven track record of migrating legacy monoliths into cloud-native microservices.
Experience running cost-optimization and cloud-resource rightsizing initiatives.
A metric-driven mindset focused on improving system uptime and reducing operational overhead.

Production Engineer (Java & AWS Cloud Infrastructure)
Kaygen, Plano, TX, USA
Job type: Full Time