MLOps Engineer - Machine Learning Platform

Goldman Sachs, Jersey City, NJ, United States

Job Description

What We Do

At Goldman Sachs, our Engineers don't just make things - we make things possible. Change the world by connecting people and capital with ideas. Solve the most challenging and pressing engineering problems for our clients. Join our engineering teams that build massively scalable software and systems, architect low latency infrastructure solutions, proactively guard against cyber threats, and leverage machine learning alongside financial engineering to continuously turn data into action. Create new businesses, transform finance, and explore a world of opportunity at the speed of markets.

Engineering, which is comprised of our Technology Division and global strategists' groups, is at the critical center of our business, and our dynamic environment requires innovative strategic thinking and immediate, real solutions. Want to push the limit of digital possibilities? Start here.

Who We Look For

We are seeking a skilled and motivated engineer to join our Artificial Intelligence Platforms organization as an MLOps Engineer on our Machine Learning Services team.

You will be part of an expert team building and operating production-grade platform and backend systems leveraged by ML engineers and application teams across the entire firm. A key focus of this role is enabling

reliable, scalable, and observable deployment of Machine Learning and Large Language Models (LLMs).

This role is best suited for engineers who enjoy working on

infrastructure, backend services, and distributed systems , rather than primarily on model experimentation and development.

Key Responsibilities:

Deliver scalable, efficient, secure and automated processes for building, deploying and monitoring Machine Learning models
Enable solutions that provide business customers with the ability to leverage the latest and greatest AI/ML infrastructure, frameworks, and tooling to deliver high impact outcomes
Develop and demonstrate deep subject matter expertise on how to optimize machine learning model deployments to scale to the specific needs of each business customer
Deliver high quality, production ready code leveraging CI/CD best practices
Author and maintain high quality documentation for both the engineering team as well as for business customers
Participate in

on-call and support rotations , helping diagnose and resolve production issues.
Continuously expand knowledge of platform architecture with a goal to

take ownership

of individual components.
Stay up to date with advancements in

AI/ML frameworks, model serving technologies, and GenAI infrastructure.

Basic Qualifications :

2 years of experience in software engineering

(backend, platform, or infrastructure) .
2 years of experience in

Python

or a similar backend programming language.
1 year of experience

supporting production ML systems

(MLOps, platform or inference-related work)
Basic understanding of

APIs

(REST or similar) and service-to-service communication.
Experience working with

containers

(e.g., Docker).
Familiarity with

Unix-based systems .
Exposure to

public cloud environments

(e.g., AWS or GCP), including core concepts such as compute, storage, and basic IAM.
Experience working with

databases

(SQL or NoSQL).
Solid grasp of

software engineering fundamentals , including debugging, testing, and maintainable code design.
Strong problem-solving skills and the ability to work effectively in a fast-paced, collaborative environment.
Curiosity and a strong desire to keep learning-especially in the

model inference and LLM platform space.
Preferred Qualifications:

4 years of experience in software engineering ( backend, platform, or infrastructure )
4 years of experience

supporting production ML systems

(MLOps, platform or inference-related work)
4 years of experience in

Python

or a similar backend programming language.
Strong understanding of the end-to-end

Model Development Lifecycle (MDLC)
Basic understanding of

distributed systems concepts

and exposure to

observability

concepts (logging, metrics, tracing).
Experience building containerized runtime environments for model serving (e.g.

vLLM, SGLang, TensorRT, Triton, AWS Multi Model Server )
Experience with infrastructure-as-code tools, such as Terraform or CloudFormation
Experience with

Kubernetes

and other container orchestration platforms in the public cloud (e.g. AWS, GCP)
Experience building Machine Learning models with frameworks such as

PyTorch and TensorFlow
Excellent communication skills and the ability to articulate complex technical concepts to both technical and non-technical stakeholders.

What Success Looks like in This Role:

Can take a

well-defined task

and drive it to completion with minimal hand-holding.
Asks

thoughtful questions

instead of getting blocked.
Understands basic

trade-offs

(e.g., performance vs. simplicity, flexibility vs. reliability).
Writes code that is

readable, testable, and easy for others to extend .
Shows curiosity about how the

entire system works end-to-end , not just their assigned ticket.