Principal MLOps Engineer Remote, US; DMV; McLean, VA; Boston, MA; San Antonio, T

teamraft.com, Boston, MA, United States

Overview

This is a U.S. based position. All of the programs we support require U.S. citizenship to be eligible for employment. All work must be conducted within the continental U.S.
Raft (https://TeamRaft.com) is a customer-obsessed defense tech company focused on empowering U.S. military and government agencies with AI/ML and data solutions. We are a leader in autonomous data fusion and Agentic AI, with a focus on Distributed Data Systems, Platforms at Scale, and Complex Application Development. We build digital solutions that impact the lives of millions of Americans.
We’re looking for an experienced

Principal ML Ops Engineer

to support our customers and join our team of problem solvers.
What you’ll do

Design, build, and maintain secure, scalable MLOps infrastructure and deployment pipelines for production ML systems
Mature Raft’s internal ML platform and model lifecycle capabilities, including packaging, registry/catalog workflows, deployment, monitoring, and operational support
Deploy and manage ML workloads on Kubernetes, including GPU-enabled clusters
Support model serving and inference infrastructure for a range of ML use cases (traditional ML, computer vision, speech/audio, and LLM-based systems)
Build and maintain CI/CD workflows for ML services, model artifacts, and platform components
Partner with ML engineers, software engineers, and product teams to move models from experimentation to reliable operational deployment
Improve observability, reliability, security, and maintainability across ML infrastructure and services
Evaluate and standardize runtime patterns, serving frameworks, and deployment architectures for production ML workloads
Contribute to infrastructure decisions across edge, on-prem, and cloud deployments
Support compliance-driven deployment practices and secure software supply chain requirements in defense environments
Collaborate with customers at advanced DoD locations as needed
What we are looking for

7+ years of hands-on experience in software engineering, platform engineering, DevOps, MLOps, or related roles
5+ years of experience with Docker and Kubernetes in production
5+ years of experience with enterprise cloud infrastructure (AWS, Azure, or similar)
Strong experience provisioning, operating, and troubleshooting Kubernetes clusters in production
Experience building and maintaining ML platforms, infrastructure, or pipelines used by engineering or data science teams
Practical experience deploying ML workloads on Kubernetes
Experience managing clusters or workloads that use GPUs
Strong understanding of Helm and Kubernetes deployment patterns
Strong scripting or programming skills, preferably Python
Experience with modern software engineering practices (Git, CI/CD, DevOps, Agile/Scrum)
Strong troubleshooting, systems thinking, and communication skills
Ability to work independently and collaboratively in a fast-moving environment
Ability to obtain and maintain a Top Secret clearance
Ability to obtain Security+ certification within the first 90 days of employment
Highly preferred

Experience with ML model serving and inference platforms (e.g., Triton Inference Server, KServe, Ray Serve, vLLM)
Experience with secure and compliant deployment practices in regulated or government environments
Kubeflow or similar Kubernetes-based ML platforms
Service mesh knowledge (e.g., Istio)
Experience provisioning and debugging complex CI/CD systems
Experience with Terraform or other IaC tools
Knowledge of software supply chain security, container hardening, vulnerability management, and runtime scanning
Experience supporting ML systems across cloud, on-prem, and edge
Experience collaborating with ML engineers on training, evaluation, packaging, and release workflows
Familiarity with storage and artifact systems used in ML platforms (S3-compatible stores, registries, metadata/catalog systems)
What success looks like

You help Raft stand up a mature, repeatable ML platform for deploying and managing models in production
ML engineers can deploy faster due to clearer, more reliable deployment, serving, and platform workflows
Model deployments are more secure, observable, and supportable across real-world mission environments
Stronger infrastructure for model lifecycle management, including deployment standards, runtime patterns, and platform ownership
Clearance Requirements

Ability to obtain and maintain a Top Secret clearance
Work Type

Remote in DMV; McLean, VA; Boston, MA; San Antonio, TX; Colorado Springs, CO; Tampa, FL; Honolulu, HI (Locations ONLY)
May require up to 40% travel
Salary Range

$150,000.00 - $200,000.00
What we will offer you

Highly competitive salary
Fully covered healthcare, dental, and vision
401(k) with company match
PTO plus 11 paid holidays
Monthly box of snacks to enjoy while doing meaningful work
Remote, hybrid, and flexible work options
And more
Equal Employment Opportunity

We are an equal opportunity employer. All applicants will be considered without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

#J-18808-Ljbffr