
Staff DevOps Engineer
webAI Inc, Austin, TX, United States
About Us:
webAI is pioneering the future of artificial intelligence by establishing the first distributed AI infrastructure dedicated to personalized AI. We recognize the evolving demands of a data-driven society for scalability and flexibility, and we firmly believe that the future of AI lies in distributed processing at the edge, bringing computation closer to the source of data generation.
Our mission is to build a future where a company's valuable data and intellectual property remain entirely private, enabling the deployment of large-scale AI models directly on standard consumer hardware without compromising the
information embedded within those models. We are developing an end-to-end platform that is secure, scalable, and fully under the control of our users, empowering enterprises with AI that understands their unique business.
We are a team driven by
truth, ownership, tenacity, and humility , and we seek individuals who resonate with these core values and are passionate about shaping the next generation of
AI.
About the Role:
We are seeking a
Staff DevOps Engineer
to architect, build, and scale secure infrastructure for deploying AI workloads across cloud and edge environments. This is a
high-impact, staff-level individual contributor role
where you will drive infrastructure strategy, lead technical initiatives, and serve as the subject matter expert on cloud architecture, security best practices, and platform reliability.
You will design scalable, automated infrastructure solutions that enable our AI platform to operate efficiently across diverse deployment scenarios-from public cloud to on-premises and edge computing environments. This role requires deep technical expertise, architectural thinking, and the ability to translate complex requirements into production-ready infrastructure automation.
Responsibilities:
Design and architect
secure, scalable cloud and edge infrastructure for deploying AI workloads across multi-cloud (AWS, Azure, GCP) and hybrid environments
Lead MLOps infrastructure initiatives
including model deployment pipelines, versioning, feature stores, experiment tracking, and monitoring for model performance and drift
Build and maintain production-grade Infrastructure as Code (IaC)
using Terraform, Ansible, or Pulumi, managing 100+ resources with GitOps workflows and automated validation
Design and operate production Kubernetes clusters
optimized for AI/ML workloads with GPU support, implementing container security, multi-tenancy, and resource optimization
Implement secure CI/CD pipelines
with integrated security controls (SAST, DAST, vulnerability scanning, secrets management) and automated deployment workflows for containerized AI models
Design comprehensive observability and monitoring
using Prometheus, Grafana, ELK, or Datadog with distributed tracing, APM, and real-time alerting aligned to SLIs/SLOs
Implement security best practices
including least-privilege access, encryption at rest/in transit, network segmentation, and automated compliance validation
Lead incident response and reliability initiatives , participate in on-call rotation, conduct post-mortems, and drive continuous improvement for system reliability
Architect disaster recovery and business continuity
strategies with automated backup, failover, and recovery processes
Develop reusable infrastructure modules and templates
to accelerate environment provisioning and standardize deployment patterns across teams
Mentor mid-level and senior engineers
on cloud architecture, DevOps best practices, and platform reliability through design reviews and technical guidance
Drive technical documentation and knowledge sharing
including runbooks, architecture decision records (ADRs), and infrastructure standards
Qualifications:
7+ years of hands-on experience
in DevOps, Site Reliability Engineering, or Infrastructure Engineering with proven track record of architecting production systems
Experience with MLOps workflows : model deployment automation, versioning, and lifecycle management
Expert-level proficiency
with Docker, Kubernetes (CKA/CKAD preferred), and cloud-native technologies in production environments
5+ years implementing Infrastructure as Code
with Terraform, Ansible, or Pulumi, managing large-scale (50+) cloud resources
Deep experience with cloud platforms
(AWS, Azure, or GCP) including compute, networking, storage, and managed services
Proven experience
building and scaling CI/CD pipelines with integrated security controls (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
Strong programming skills
in Python (preferred for automation), Bash, or Go for infrastructure tooling and automation
Production experience
with observability and monitoring tools: Prometheus, Grafana, ELK, CloudWatch, Datadog, or similar
Demonstrated experience
with GitOps methodologies and declarative infrastructure management
Strong understanding of security best practices : encryption, secrets management, identity and access management (IAM), network security
Excellent written and verbal communication skills
for technical documentation and cross-functional collaboration
Preferred Skills:
Experience architecting multi-cloud or hybrid cloud
environments with portability and interoperability considerations
Hands-on experience deploying large language models (LLMs)
or transformer models at scale with model serving infrastructure
Expertise in Zero Trust architecture
and modern security patterns for cloud-native applications
Experience with service mesh technologies
(Istio, Linkerd) for microservices communication and observability
Strong understanding of AI/ML infrastructure : feature stores, model registries, A/B testing infrastructure, and model monitoring
Experience with edge computing
deployments and distributed system architectures
Cost optimization expertise : FinOps practices, resource rightsizing, and cloud cost management
Experience mentoring or leading
technical initiatives across engineering teams
Certifications : CKA, CKAD, Terraform Associate, AWS Solutions Architect, Azure Administrator, or GCP Professional Cloud Architect
Core Values:
We at webAI are committed to living out the core values we have put in place as the foundation on which we operate as a team. We seek individuals who exemplify the following:
Truth -
Emphasizing transparency and honesty in every interaction and decision.
Ownership -
Taking full responsibility for one's actions and decisions, demonstrating commitment to the success of our clients.
Tenacity -
Persisting in the face of challenges and setbacks, continually striving for excellence and improvement.
Humility -
Maintaining a respectful and learning-oriented mindset, acknowledging the strengths and contributions of others.
Benefits:
Competitive salary
Comprehensive health, dental, and vision benefits package
401(k) match (U.S.-based employees only)
$200/month Health & Wellness stipend
Continuing Education support
$500/year Function Health subscription (U.S.-based employees only)
Free parking for in-office employees
Flexible Time Off (FTO)
Parental leave for eligible employees
Supplemental life insurance
webAI is an Equal Opportunity Employer and does not discriminate against any employee or applicant on the basis of age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We adhere to these principles in all aspects of employment, including recruitment, hiring, training, compensation,
promotion, benefits, social and recreational programs, and discipline. In addition, it is the policy of webAI to provide reasonable accommodation to qualified employees who have protected disabilities to the extent required by applicable laws, regulations and ordinances where a particular employee works.
webAI is pioneering the future of artificial intelligence by establishing the first distributed AI infrastructure dedicated to personalized AI. We recognize the evolving demands of a data-driven society for scalability and flexibility, and we firmly believe that the future of AI lies in distributed processing at the edge, bringing computation closer to the source of data generation.
Our mission is to build a future where a company's valuable data and intellectual property remain entirely private, enabling the deployment of large-scale AI models directly on standard consumer hardware without compromising the
information embedded within those models. We are developing an end-to-end platform that is secure, scalable, and fully under the control of our users, empowering enterprises with AI that understands their unique business.
We are a team driven by
truth, ownership, tenacity, and humility , and we seek individuals who resonate with these core values and are passionate about shaping the next generation of
AI.
About the Role:
We are seeking a
Staff DevOps Engineer
to architect, build, and scale secure infrastructure for deploying AI workloads across cloud and edge environments. This is a
high-impact, staff-level individual contributor role
where you will drive infrastructure strategy, lead technical initiatives, and serve as the subject matter expert on cloud architecture, security best practices, and platform reliability.
You will design scalable, automated infrastructure solutions that enable our AI platform to operate efficiently across diverse deployment scenarios-from public cloud to on-premises and edge computing environments. This role requires deep technical expertise, architectural thinking, and the ability to translate complex requirements into production-ready infrastructure automation.
Responsibilities:
Design and architect
secure, scalable cloud and edge infrastructure for deploying AI workloads across multi-cloud (AWS, Azure, GCP) and hybrid environments
Lead MLOps infrastructure initiatives
including model deployment pipelines, versioning, feature stores, experiment tracking, and monitoring for model performance and drift
Build and maintain production-grade Infrastructure as Code (IaC)
using Terraform, Ansible, or Pulumi, managing 100+ resources with GitOps workflows and automated validation
Design and operate production Kubernetes clusters
optimized for AI/ML workloads with GPU support, implementing container security, multi-tenancy, and resource optimization
Implement secure CI/CD pipelines
with integrated security controls (SAST, DAST, vulnerability scanning, secrets management) and automated deployment workflows for containerized AI models
Design comprehensive observability and monitoring
using Prometheus, Grafana, ELK, or Datadog with distributed tracing, APM, and real-time alerting aligned to SLIs/SLOs
Implement security best practices
including least-privilege access, encryption at rest/in transit, network segmentation, and automated compliance validation
Lead incident response and reliability initiatives , participate in on-call rotation, conduct post-mortems, and drive continuous improvement for system reliability
Architect disaster recovery and business continuity
strategies with automated backup, failover, and recovery processes
Develop reusable infrastructure modules and templates
to accelerate environment provisioning and standardize deployment patterns across teams
Mentor mid-level and senior engineers
on cloud architecture, DevOps best practices, and platform reliability through design reviews and technical guidance
Drive technical documentation and knowledge sharing
including runbooks, architecture decision records (ADRs), and infrastructure standards
Qualifications:
7+ years of hands-on experience
in DevOps, Site Reliability Engineering, or Infrastructure Engineering with proven track record of architecting production systems
Experience with MLOps workflows : model deployment automation, versioning, and lifecycle management
Expert-level proficiency
with Docker, Kubernetes (CKA/CKAD preferred), and cloud-native technologies in production environments
5+ years implementing Infrastructure as Code
with Terraform, Ansible, or Pulumi, managing large-scale (50+) cloud resources
Deep experience with cloud platforms
(AWS, Azure, or GCP) including compute, networking, storage, and managed services
Proven experience
building and scaling CI/CD pipelines with integrated security controls (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
Strong programming skills
in Python (preferred for automation), Bash, or Go for infrastructure tooling and automation
Production experience
with observability and monitoring tools: Prometheus, Grafana, ELK, CloudWatch, Datadog, or similar
Demonstrated experience
with GitOps methodologies and declarative infrastructure management
Strong understanding of security best practices : encryption, secrets management, identity and access management (IAM), network security
Excellent written and verbal communication skills
for technical documentation and cross-functional collaboration
Preferred Skills:
Experience architecting multi-cloud or hybrid cloud
environments with portability and interoperability considerations
Hands-on experience deploying large language models (LLMs)
or transformer models at scale with model serving infrastructure
Expertise in Zero Trust architecture
and modern security patterns for cloud-native applications
Experience with service mesh technologies
(Istio, Linkerd) for microservices communication and observability
Strong understanding of AI/ML infrastructure : feature stores, model registries, A/B testing infrastructure, and model monitoring
Experience with edge computing
deployments and distributed system architectures
Cost optimization expertise : FinOps practices, resource rightsizing, and cloud cost management
Experience mentoring or leading
technical initiatives across engineering teams
Certifications : CKA, CKAD, Terraform Associate, AWS Solutions Architect, Azure Administrator, or GCP Professional Cloud Architect
Core Values:
We at webAI are committed to living out the core values we have put in place as the foundation on which we operate as a team. We seek individuals who exemplify the following:
Truth -
Emphasizing transparency and honesty in every interaction and decision.
Ownership -
Taking full responsibility for one's actions and decisions, demonstrating commitment to the success of our clients.
Tenacity -
Persisting in the face of challenges and setbacks, continually striving for excellence and improvement.
Humility -
Maintaining a respectful and learning-oriented mindset, acknowledging the strengths and contributions of others.
Benefits:
Competitive salary
Comprehensive health, dental, and vision benefits package
401(k) match (U.S.-based employees only)
$200/month Health & Wellness stipend
Continuing Education support
$500/year Function Health subscription (U.S.-based employees only)
Free parking for in-office employees
Flexible Time Off (FTO)
Parental leave for eligible employees
Supplemental life insurance
webAI is an Equal Opportunity Employer and does not discriminate against any employee or applicant on the basis of age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We adhere to these principles in all aspects of employment, including recruitment, hiring, training, compensation,
promotion, benefits, social and recreational programs, and discipline. In addition, it is the policy of webAI to provide reasonable accommodation to qualified employees who have protected disabilities to the extent required by applicable laws, regulations and ordinances where a particular employee works.