
Cloud MLOps Engineer
Insight Global, Austin, TX, United States
We are seeking a Cloud MLOps Engineer to build and operate the cloud infrastructure that powers machine learning for a humanoid robotics platform. This role sits at the intersection of ML research, production systems, and end-user applications, with a strong focus on robot telemetry data, model lifecycle management, and production deployment.
You will enable researchers and applied ML engineers to reliably train, evaluate, and deploy models at scale, while ensuring telemetry-driven insights flow from robots in the real world back into continuous learning systems.
What You'll Do
Design, deploy, and maintain cloud-native MLOps platforms supporting large-scale ML training, evaluation, and inference workloads
Operate Kubernetes-based infrastructure (self-managed or managed services such as GKE, EKS, or AKS) for ML workloads and data applications
Build and maintain end-to-end ML pipelines that bridge research workflows with production systems
Support robot telemetry ingestion, processing, and analytics, enabling model feedback loops from deployed humanoid robots
Integrate and operate ML tooling such as MLflow, Weights & Biases, Slurm, or similar systems for experiment tracking, scheduling, and reproducibility
Enable model deployment to production, including CI/CD for models, versioning, monitoring, and rollback strategies
Partner closely with ML researchers, perception, controls, and applications teams to productionize models safely and efficiently
Implement observability across ML systems, including model performance, data drift, and system health
Improve reliability, scalability, and security of cloud ML infrastructure supporting realworld robotic systems
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
Strong experience with cloud platforms: AWS, GCP, and/or Azure
Hands-on experience operating Kubernetes or managed Kubernetes services in production
Experience building or maintaining MLOps platforms supporting training and inference
Familiarity with ML experiment tracking and orchestration tools (e.g., MLflow, Weights & Biases, Slurm, Ray, Kubeflow, or similar)
Experience deploying ML models into production-facing applications or services
Strong understanding of CI/CD, infrastructure-as-code, and automation
Proficiency in Python; experience with Bash or another scripting language
Ability to collaborate effectively across research and engineering teams
Nice to Have Skills & Experience
Experience working with robotics or real-time telemetry data
Familiarity with streaming data systems (e.g., Kafka, Pub/Sub, Kinesis)
Experience supporting GPU workloads in cloud or Kubernetes environments
Exposure to edge-cloud ML deployment or fleet-based systems
Prior work in robotics, autonomy, or embodied AI environments
Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.
You will enable researchers and applied ML engineers to reliably train, evaluate, and deploy models at scale, while ensuring telemetry-driven insights flow from robots in the real world back into continuous learning systems.
What You'll Do
Design, deploy, and maintain cloud-native MLOps platforms supporting large-scale ML training, evaluation, and inference workloads
Operate Kubernetes-based infrastructure (self-managed or managed services such as GKE, EKS, or AKS) for ML workloads and data applications
Build and maintain end-to-end ML pipelines that bridge research workflows with production systems
Support robot telemetry ingestion, processing, and analytics, enabling model feedback loops from deployed humanoid robots
Integrate and operate ML tooling such as MLflow, Weights & Biases, Slurm, or similar systems for experiment tracking, scheduling, and reproducibility
Enable model deployment to production, including CI/CD for models, versioning, monitoring, and rollback strategies
Partner closely with ML researchers, perception, controls, and applications teams to productionize models safely and efficiently
Implement observability across ML systems, including model performance, data drift, and system health
Improve reliability, scalability, and security of cloud ML infrastructure supporting realworld robotic systems
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Required Skills & Experience
Strong experience with cloud platforms: AWS, GCP, and/or Azure
Hands-on experience operating Kubernetes or managed Kubernetes services in production
Experience building or maintaining MLOps platforms supporting training and inference
Familiarity with ML experiment tracking and orchestration tools (e.g., MLflow, Weights & Biases, Slurm, Ray, Kubeflow, or similar)
Experience deploying ML models into production-facing applications or services
Strong understanding of CI/CD, infrastructure-as-code, and automation
Proficiency in Python; experience with Bash or another scripting language
Ability to collaborate effectively across research and engineering teams
Nice to Have Skills & Experience
Experience working with robotics or real-time telemetry data
Familiarity with streaming data systems (e.g., Kafka, Pub/Sub, Kinesis)
Experience supporting GPU workloads in cloud or Kubernetes environments
Exposure to edge-cloud ML deployment or fleet-based systems
Prior work in robotics, autonomy, or embodied AI environments
Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.