Mediabistro logo
job logo

Principal Engineer, AI Platform & Infrastructure

SpreeAI, San Francisco, CA, United States


Principal Engineer, AI Platform & Infrastructure
About the Role
SPREEAI

is building the future of AI-powered commerce through photorealistic virtual try‑on and multimodal intelligence. We bring together cutting‑edge AI and real‑world retail to deliver production systems that redefine how people shop online.

We are looking for a Principal Engineer to build the infrastructure, deployment pipelines, and observability systems that enable multimodal AI models to move from research prototypes to reliable, production‑grade deployments powering real‑time virtual try‑on experiences for global retail partners.

This role spans ML platform engineering, deployment systems, GPU infrastructure, and observability. You will partner closely with Applied Science, AI Platform, Product, and Partner Engineering to enable rapid research iteration and reliable model delivery at scale.

What You'll Own
ML Platform & Training Enablement

Build and operate SPREEAI’s end‑to‑end ML platform spanning training, evaluation, deployment, and monitoring.

Enable scalable and reliable training workflows through orchestration, infrastructure, and resource management systems.

Define platform standards for model packaging, model registry, dataset lineage, experiment tracking, checkpointing, and deployment automation.

Deployment, Inference & Observability

Enable reliable and scalable inference deployments through standardized serving, orchestration, and monitoring frameworks.

Build and operate model deployment pipelines with versioning, reproducibility, rollback, approval gates, evaluation gates, and production observability.

Establish production SLOs for latency, availability, error rate, GPU saturation, cold‑start time, cost per inference, and model quality drift.

Standardize and support serving infrastructure using modern inference runtimes such as vLLM, NVIDIA Triton, TensorRT-LLM, Ray Serve, TorchServe, ONNX Runtime, or equivalent systems.

Design and manage GPU allocation, scheduling, and resource utilization across training and inference workloads.

Improve GPU utilization, throughput, latency, reliability, and cost efficiency across model lifecycle systems.

Design and operate model evaluation and benchmarking systems, including automated regression detection and quality gates for production releases.

Partner with research teams to productionize new capabilities by providing robust infrastructure, tooling, and deployment pathways.

What We're Looking For

10+ years of software engineering / infrastructure experience, with 5+ years in ML infrastructure, MLOps, distributed systems, or AI platform engineering.

Deep experience with Python, PyTorch, Kubernetes, Docker, cloud infrastructure, and GPU-based workloads.

Strong understanding of distributed systems and large‑scale ML infrastructure design.

Experience with ML workflow orchestration systems such as Ray, Kubeflow, Argo, Airflow, Flyte, or Metaflow.

Experience deploying and managing production inference systems using platforms like Triton, vLLM, TensorRT-LLM, Ray Serve, KServe, Seldon, BentoML, TorchServe, or custom services.

Strong understanding of inference optimization techniques such as batching, quantization, CUDA graphs, and memory‑aware scheduling.

Experience with model registries, experiment tracking, CI/CD for ML, canary deployments, shadow traffic, rollback strategies, and production monitoring.

Strong cloud experience across AWS, GCP, Azure, or GPU‑focused providers like CoreWeave, Lambda Labs, or RunPod.

Ability to debug performance bottlenecks across distributed systems, containers, networking, GPU memory, and storage layers.

Strong ownership mindset with the ability to define architecture, set platform standards, and drive execution across teams.

Nice to Have

Experience with multimodal, vision, or generative AI systems.

Experience with large‑scale GPU clusters e.g. A100/H100, NCCL, and high‑throughput data pipelines.

Experience designing evaluation and monitoring systems for generative AI workloads.

Familiarity with ML security, privacy, and data governance practices.

Experience building internal developer platforms for research teams.

Success Looks Like

Within 6 months, you will:

Create reliable research‑to‑production pathways for SPREEAI’s core AI models.

Reduce manual model deployment friction through standardized pipelines and tooling.

Improve GPU utilization and reduce training and inference costs.

Establish robust observability and evaluation gates for production model releases.

Accelerate the delivery of new AI capabilities into partner‑facing experiences.

Why This Role Matters
This is not a traditional DevOps role. This is the infrastructure backbone that enables SPREEAI to turn frontier AI research into reliable, scalable, production‑grade systems. You will define the systems powering real‑time AI experiences where latency, cost, and model quality directly impact end‑user experience.

Why Join SPREEAI?

Build the Core AI Infrastructure, Not Just Features:

You will define how multimodal AI systems are reliably deployed, monitored, and scaled—directly shaping the performance, cost efficiency, and reliability of real‑world AI products.

Own Systems End‑to‑End:

You will own critical infrastructure decisions across deployment, observability, and resource management, with direct impact on production systems serving real partner traffic.

Work on Hard, High‑Leverage Problems:

From GPU efficiency to large‑scale deployment systems, you will tackle challenges that sit at the frontier of real‑time AI infrastructure.

High Velocity, Low Bureaucracy & Direct Impact:

We operate with tight feedback loops between research, platform, and product, enabling rapid iteration and meaningful impact without organizational friction.

SPREEAI is a fast‑growing, innovative AI company at the forefront of fashion and e‑commerce, revolutionizing how consumers engage with fashion through lifelike photorealistic try‑on technology and hyper‑personalized shopping experiences. Our mission is to redefine the retail landscape with cutting‑edge AI solutions that blend high fashion and technology. We thrive in a dynamic, fast‑paced environment where creativity meets technology to drive real impact. If you are passionate about innovation and shaping the future of fashion, SPREEAI offers a platform to make your mark.

#J-18808-Ljbffr