
Director, AI Platform Engineering
NYC Health + Hospitals, New York, NY, United States
The Director of AI Platform Engineering provides strategic leadership for the cloud, platform, and deployment infrastructure supporting Artificial Intelligence (AI) across the System. This role ensures AI systems used in clinical workflows operate safely, reliably, securely, and are in compliance with applicable laws and NYC Health + Hospitals rules and regulations. The Director leads platform engineering, cloud architecture, Continuous Integration and Continuous Delivery/Deployment (CI/CD) modernization, and reliability functions ensuring that AI tools enhance clinical excellence and protect patient safety.
Essential Duties and Responsibilities:
- Provides strategic leadership for cloud, platform, and infrastructure engineering, developing and leading multi-year roadmaps, standards, and strategies for the secure and scalable deployment of AI products.
- Oversees architecture and governance of CI/CD pipelines, infrastructure‑as‑code (Terraform), and Kubernetes/Azure Kubernetes Services (AKS) orchestration to support reliable AI deployment.
- Defines and oversees the infrastructure for the high-volume, low-latency data pipelines, feature stores, and data access layers required for training and real‑time serving of AI models.
- Establishes enterprise-wide reliability, and monitoring frameworks to ensure stable, and safe operation of AI systems used by clinicians and care teams.
- Implements platform controls and audit trails to monitor and ensure Responsible AI practices, model explainability (XAI), and checks for model drift and bias on an ongoing basis.
- Partners with product management, product development, cybersecurity, Machine Learning Operations (MLOps) engineering, and interoperability teams to ensure AI platform readiness, safe integrations.
- Leads incident management and root‑cause analysis, to minimize disruptions to clinical workflows and drive reliability improvements.
- Ensures the AI platform and infrastructure provide the necessary controls, logging, and audit capabilities to meet compliance requirements and support AI safety frameworks.
- Develops long‑term platform resilience, disaster recovery, and cost optimization strategies to support System‑wide AI expansion.
- Defines and standardizes the platform's toolchain and Application Programming Interface (API) for the Machine Learning (ML) lifecycle, including model experimentation tracking (e.g., MLflow, ClearML), model registry, and automated testing/validation frameworks.
- Manages a team of platform and infrastructure engineers.
- Performs other duties as assigned.
Qualifications:
- Master’s Degree from an accredited college or university in Computer Science, Information Systems or Technology, Cybersecurity, Hospital Administration, Health Care Planning, Business Administration, Mathematics, Engineering or Public Administration; and three (3) years of progressively responsible experience in health care information security, multifaced information technology, health and medical service administration, public administration, or a related discipline with an emphasis on systems programming, systems engineering, software developing, or providing technical support as a specialist; two (2) years of which must have been in a related administrative, managerial or supervisory capacity; or,
- Bachelor’s Degree from an accredited college or university in disciplines, as listed in “1” above; and five (5) years of progressively responsible experience in health care information security, multifaced information technology, health and medical service administration, public administration, or a related discipline with an emphasis on systems programming, systems engineering, software developing, or providing technical support as a specialist; two (2) years of which must have been in a related administrative, managerial or supervisory capacity.
Minimum Qualifications:
- Master’s degree from an accredited college or university in Computer science, Engineering, Information Systems, or related discipline; and,
- Five (5) years of experience in Machine Learning Operations (MLOps), Machine Learning (ML) engineering, Artificial Intelligence (AI) platform engineering, or operating production Machine Learning (ML) / Large Language Model (LLM) system; or ten (10) years of experience in Software and Data Engineering.
Certifications Preferred:
- Professional certifications in cloud architecture, ML/AI engineering, or DevOps from leading cloud platforms.
Preferred Knowledge Areas, Skills, Abilities, and other Qualifications:
- Figma, Sketch, Adobe XD, or similar design and prototyping tools.
- Expertise in Azure architecture, Kubernetes/ Azure Kubernetes Services (AKS), Terraform, Continuous Integration and Continuous Deployment (CI/CD), and automation frameworks.
- Experience supporting production AI/ML systems or mission‑critical workloads.
- Knowledge of observability tools, monitoring frameworks, and reliability engineering practices.
- Understanding of security and compliance standards including Health Insurance Portability and Accountability Act of 1996 (HIPAA) and National Institute of Standards and Technology (NIST).
- Demonstrated leadership, cross‑functional collaboration, and technical communication skills.
- Strong stakeholder engagement and change‑management skill.
- Experience in healthcare, public sector, or other regulated environments.
- Experience deploying or supporting AI/ML, LLM, or agentic AI systems in production.
- Familiarity with Site Reliability Engineering (SRE) or platform engineering frameworks.
Experience Using the Following Software and/or Platforms:
- Azure cloud services, Docker, Kubernetes/AKS, Terraform, CI/CD platforms (Azure DevOps, GitHub Actions, Jenkins), monitoring/observability tools (Grafana, Azure Monitor), secrets/IAM security tooling.