eGain Corporation
Director, Site Reliability Engineering Sunnyvale, CA , USA
eGain Corporation, Sunnyvale, California, United States, 94087
eGain is the leader in AI knowledge management solutions for enterprises. As organizations recognize the critical value of trusted knowledge and content feeding AI systems, eGain provides the single source of truth—explainable, reliable, and maintainable—that serves as the repository for all enterprise know-how including SOPs, policy documents, troubleshooting guides, and product information. This foundation enables scalable and effective AI automation of business operations, with customer service as the primary point of ROI. Our solutions power leading companies including JP Morgan, Liberty Mutual, Florida Blue, and Bosch.
The Opportunity
Join us in reimagining knowledge management as mission-critical infrastructure for the AI-powered enterprise. We’re seeking talented, hungry, and bold leaders to shape the future of how enterprises leverage AI and knowledge at scale. Position Overview
As Director of Site Reliability Engineering, you will ensure that eGain’s AI knowledge management platform operates with the reliability, performance, and resilience that enterprise customers demand. You’ll lead the strategy and execution for observability, incident management, capacity planning, and continuous improvement of our production systems. This role is critical as our platform becomes mission-critical infrastructure for the world’s leading enterprises. Key Responsibilities
Build and lead a world-class SRE organization that ensures exceptional reliability and performance of eGain’s cloud services Define and achieve ambitious SLOs/SLAs that meet the demands of enterprise customers operating 24/7 customer service operations Establish comprehensive observability across the platform including monitoring, logging, tracing, and alerting Drive incident response processes, post-mortems, and continuous improvement to prevent recurring issues Lead capacity planning and performance optimization to ensure the platform scales efficiently with customer growth Implement automation for deployment, operations, and remediation to reduce toil and improve reliability Partner with platform and application engineering teams to build reliability into the system from the ground up Champion a culture of reliability engineering across the organization, educating teams on best practices Manage disaster recovery planning and business continuity to protect customer operations Own the technical relationship with customers on reliability and performance topics What We’re Looking For
10+ years of experience in software engineering, operations, or SRE roles with 5+ years in SRE leadership Deep expertise in observability tools, monitoring systems, and incident management practices Strong background in distributed systems, cloud infrastructure, and production operations at scale Experience establishing and achieving SLOs/SLAs for enterprise SaaS or mission-critical systems Proficiency with automation, infrastructure-as-code, and modern DevOps/SRE tooling Track record of improving system reliability through data-driven approaches and systematic problem-solving Excellent incident management and crisis leadership skills Strong collaboration abilities and experience partnering with engineering teams to improve reliability Passion for operational excellence and continuous improvement Bold thinking about what’s possible in system reliability combined with pragmatic execution Why eGain
Ensure the reliability of systems that power customer service for the world’s leading enterprises Build SRE practices from the ground up with significant impact and visibility Work with modern cloud technologies and solve complex reliability challenges Lead a team focused on operational excellence and engineering rigor Our Hiring Process is “Easy with eGain”
Step 1
Aptitude section – this is a GRE style test (60 minutes or less) Functional section – this is a take-home test Step 2
Panel interview (in-person at eGain Sunnyvale office) Next step
Email your resumé to hiring@egain.com with the position title “Director, Site Reliability Engineering” in the email subject. Compensation
Base salary is $250,000 per year. Stock options. Please note that the compensation package can vary based on the candidate’s qualifications and experience level.
#J-18808-Ljbffr
Join us in reimagining knowledge management as mission-critical infrastructure for the AI-powered enterprise. We’re seeking talented, hungry, and bold leaders to shape the future of how enterprises leverage AI and knowledge at scale. Position Overview
As Director of Site Reliability Engineering, you will ensure that eGain’s AI knowledge management platform operates with the reliability, performance, and resilience that enterprise customers demand. You’ll lead the strategy and execution for observability, incident management, capacity planning, and continuous improvement of our production systems. This role is critical as our platform becomes mission-critical infrastructure for the world’s leading enterprises. Key Responsibilities
Build and lead a world-class SRE organization that ensures exceptional reliability and performance of eGain’s cloud services Define and achieve ambitious SLOs/SLAs that meet the demands of enterprise customers operating 24/7 customer service operations Establish comprehensive observability across the platform including monitoring, logging, tracing, and alerting Drive incident response processes, post-mortems, and continuous improvement to prevent recurring issues Lead capacity planning and performance optimization to ensure the platform scales efficiently with customer growth Implement automation for deployment, operations, and remediation to reduce toil and improve reliability Partner with platform and application engineering teams to build reliability into the system from the ground up Champion a culture of reliability engineering across the organization, educating teams on best practices Manage disaster recovery planning and business continuity to protect customer operations Own the technical relationship with customers on reliability and performance topics What We’re Looking For
10+ years of experience in software engineering, operations, or SRE roles with 5+ years in SRE leadership Deep expertise in observability tools, monitoring systems, and incident management practices Strong background in distributed systems, cloud infrastructure, and production operations at scale Experience establishing and achieving SLOs/SLAs for enterprise SaaS or mission-critical systems Proficiency with automation, infrastructure-as-code, and modern DevOps/SRE tooling Track record of improving system reliability through data-driven approaches and systematic problem-solving Excellent incident management and crisis leadership skills Strong collaboration abilities and experience partnering with engineering teams to improve reliability Passion for operational excellence and continuous improvement Bold thinking about what’s possible in system reliability combined with pragmatic execution Why eGain
Ensure the reliability of systems that power customer service for the world’s leading enterprises Build SRE practices from the ground up with significant impact and visibility Work with modern cloud technologies and solve complex reliability challenges Lead a team focused on operational excellence and engineering rigor Our Hiring Process is “Easy with eGain”
Step 1
Aptitude section – this is a GRE style test (60 minutes or less) Functional section – this is a take-home test Step 2
Panel interview (in-person at eGain Sunnyvale office) Next step
Email your resumé to hiring@egain.com with the position title “Director, Site Reliability Engineering” in the email subject. Compensation
Base salary is $250,000 per year. Stock options. Please note that the compensation package can vary based on the candidate’s qualifications and experience level.
#J-18808-Ljbffr