Director, MLOps Engineering

NYC Health + Hospitals, New York, NY, United States

The Director of Machine Learning Operations (MLOps) Engineering provides strategic and operational leadership for the end‑to‑end Machine Learning (ML) and agentic Artificial Intelligence (AI) operations platform. This role oversees the full lifecycle required to take AI prototypes—including ML models, Large Language Model (LLM) based systems, and agentic/Retrieval-Augmented Generation (RAG) pipelines—from development to production, including environment setup, pipeline engineering, integration, Quality Assurance (QA), deployment, and ongoing maintenance.

Essential Duties and Responsibilities

Defines the multi-year technical roadmap for the ML platform, continuously evaluating emerging MLOps tools, LLM frameworks, and infrastructure innovations to maintain a cutting‑edge and efficient platform, guiding long-term strategy for reliability, lifecycle automation, cost optimization, and scaling across the System.
Leads the setup, governance, and maintenance of Quality Assurance (QA), staging, and production environments for ML applications, LLM pipelines, and agentic AI systems.
Owns the transition from AI prototypes to production, including model refactoring, packaging, optimization, dependency management, and deployment readiness validation.
Modifies and operationalizes ML and LLM applications to run as scalable mini‑batch or streaming pipelines to meet clinical workflow requirements.
Establishes automated model re‑training and re‑deployment pipelines triggered by performance degradation, data drift, or scheduled intervals, ensuring continuous model improvement.
Integrates AI applications with enterprise data platforms, interface engines, cloud services, container orchestration environments, model tracking tools, and clinical workflow systems in support of end‑to‑end AI operations.
Collaborates with Data Platform and AI Governance teams to ensure compliant data and features are usable by ML/LLM pipelines; manages the infrastructure for the low‑latency feature serving layer required for real‑time inference.
Designs, manages, and maintains infrastructure for Retrieval‑Augmented Generation (RAG) pipelines, vector databases, embedding generation, orchestration layers, and automated agentic tools.
Implements and enforces a Model Governance framework, including automated checks for model versioning, lineage tracking, reproducibility, model card generation, and secure model access controls across all environments.
Establishes and executes robust QA processes including unit, functional, and integration testing to ensure AI applications behave consistently with validated prototypes prior to deployment.
Develops and manages reliability and observability frameworks covering logging, monitoring, alerting, data quality drift detection, and runtime monitoring.
Manages and optimizes compute resource utilization (i.e. CPU/GPU/TPU) and cloud spending related to model training, experimentation, and high‑volume, real‑time model serving.
Collaborates with Product Development, Platform Engineering, Interoperability, Cybersecurity, and clinical partners to ensure safe, integrated, and workflow‑appropriate deployment of AI tools.
Leads incident response, troubleshooting, root‑cause analysis, and continuous reliability improvement for production AI systems.
Translates AI governance policies into automated, auditable, and repeatable technical controls embedded within the MLOps pipelines to ensure compliance.
Manages a team of ML Engineers, QA Engineers, and LLM Engineers.
Performs other duties as assigned.

Minimum Qualifications

Master’s Degree from an accredited college or university in Computer Science, Information Systems or Technology, Cybersecurity, Hospital Administration, Health Care Planning, Business Administration, Mathematics, Engineering or Public Administration; and three (3) years of progressively responsible experience in health care information security, multifaced information technology, health and medical service administration, public administration, or a related discipline with an emphasis on systems programming, systems engineering, software developing, or providing technical support as a specialist; two (2) years of which must have been in a related administrative, managerial or supervisory capacity; or,
Bachelor’s Degree from an accredited college or university in disciplines, as listed in “1” above; and five (5) years of progressively responsible experience in health care information security, multifaced information technology, health and medical service administration, public administration, or a related discipline with an emphasis on systems programming, systems engineering, software developing, or providing technical support as a specialist; two (2) years of which must have been in a related administrative, managerial or supervisory capacity.

Assignment Qualification Preferences

Master’s degree from an accredited college or university in Computer Science, Computer Engineering, or a related technical discipline; and,
Five (5) years of experience in Machine Learning Operations (MLOps), Machine Learning (ML) engineering, Artificial Intelligence (AI) platform engineering, or production Machine Learning (ML)/Large Language Model (LLM) system operations or ten (10) years of experience in Software Engineering and Data Engineering.

Certifications Preferred

Professional certifications in cloud architecture, ML/AI engineering, or DevOps from leading cloud platforms.

Preferred Knowledge Areas, Skills, Abilities, and Other Qualifications

Deep experience with Databricks, Spark/Scala, MLFlow, Azure cloud, Docker, Terraform, Continuous Integration and Continuous Deployment (CI/CD), and Kubernetes/ Azure Kubernetes Services (AKS).
Experience deploying production ML systems, agentic AI systems, Retrieval-Augmented Generation (RAG) pipelines, vector databases (DBs), orchestration frameworks, or LLM applications.
Demonstrated ability to convert prototypes into production systems, including optimizing pipelines for streaming and mini‑batch.
Experience integrating AI applications with Electronic Health Record (HER) and clinical data platforms (i.e. Epic, Health Level 7 (HL7)/Fast Healthcare Interoperability Resources (FHIR), Mirth Connect, Laboratory Information System (LIS)/Picture Archiving and Communication System (PACS).
Strong background in observability, monitoring, drift detection, Quality Assurance (QA) automation, and ML system reliability.
Understanding of Health Insurance Portability and Accountability Act of 1996 (HIPAA), National Institute of Standards and Technology (NIST), responsible AI, and safety‑critical ML governance.
Proven track record leading engineering teams and collaborating across clinical, operational, and Information Technology (IT) domains.
Strong communication skills and ability to translate complex technical systems into clinically meaningful explanations.
Experience deploying AI systems in healthcare, public‑sector environments, or other highly regulated systems
Experience building or operating large‑scale RAG or agentic AI systems in production.
Familiarity with Plotly visualization, clinical note processing, or multimodal clinical models.
Experience Using the Following Software and/or Platforms:
- Python, Java, Scala, PySpark, Structured Query Language (SQL).
- Containerization, Event‑driven programming.
- Microsoft and Google operating systems.

#J-18808-Ljbffr