Pluralis Research is hiring: Machine Learning Engineer - ML Training Platform in

Pluralis Research, San Francisco, CA, United States

Overview
Pluralis Research is pioneering Protocol Learning – a fully decentralised way to train and deploy AI models that opens this layer to individuals rather than well resourced corporates. By pooling compute from many participants, incentivising their efforts, and preventing any single party from controlling a model’s full weights, we’re creating a genuinely open, collaborative path to frontier‑scale AI.

We’re looking for an ML Training Platform Engineer to architect, build, and scale the foundational infrastructure powering our decentralised ML training platform. You will own core systems spanning infrastructure orchestration, distributed compute, and services integration, enabling continuous experimentation and large‑scale model training.

Responsibilities

Multi-Cloud Infrastructure : Design resource management systems provisioning and orchestrating compute across AWS, GCP, and Azure using infrastructure-as-code (Pulumi/Terraform). Handle dynamic scaling, state synchronization, and concurrent operations across hundreds of heterogeneous nodes.

Distributed Training Systems : Architect fault-tolerant infrastructure for distributed ML. GPU clusters, NVIDIA runtime, S3 checkpointing, large dataset management and streaming, health monitoring, and resilient retry strategies.

Real-World Networking : Build systems that simulate and handle real-world network conditions — bandwidth shaping, latency injection, packet loss — while managing dynamic node churn and ensuring efficient data flow across workers with heterogeneous connectivity, because our training happens on consumer nodes and non co‑located infrastructure, not in a datacenter.

What You’ll Bring
Ideally, you’ll have 5+ years of work experience with deep experience in:

Infrastructure & Platform Engineering: Production experience with infrastructure-as-code (Pulumi/Terraform/CloudFormation) managing multi-cloud deployments, lifecycle orchestration, self-healing systems, Docker/Kubernetes (EKS), GPU workloads, and heterogeneous clusters at scale.

Distributed Systems & ML Infrastructure: Deep understanding of distributed training workflows, checkpointing, data sharding, model versioning, long-running job orchestration, decentralized networking (P2P, NAT traversal, traffic shaping), and real-world bandwidth constraints.

Systems Programming & Reliability: Strong Python engineering (asyncio, concurrency, retry logic, cloud SDKs, CLI tooling) with hands-on experience in observability, SRE practices, monitoring (Prometheus/Grafana), performance profiling, and incident response.

What we’re looking for

Experience in a startup environment with an emphasis on micro‑services orchestration or big tech background

Deep understanding of multi-cloud infra & distributed training systems

A team player with high attention to detail

A strong passion to join

Backed by Union Square Ventures and other tier‑1 investors, we’re a world‑class, deeply technical team of ML researchers. Pluralis is unapologetically ideological. We view the world as a better place if we are able to implement what we are attempting, and Protocol Learning as the only plausible approach to preventing a handful of massive corporations monopolising model development, access and release, and achieving massive economic capture. …if this resonates, please apply.

#J-18808-Ljbffr

In Summary: Pluralis Research is pioneering Protocol Learning – a fully decentralised way to train and deploy AI models . We’re looking for an ML Training Platform Engineer to architect, build, and scale the foundational infrastructure powering our decentralised ML training platform .

En Español:

Pluralis Research es pionera en el aprendizaje de protocolo una forma totalmente descentralizada de capacitar y implementar modelos de IA que abre esta capa a individuos en lugar de empresas con buenos recursos. Al agrupar la computación de muchos participantes, incentivar sus esfuerzos e impedir que cualquier parte controle los pesos completos de un modelo, estamos creando un camino genuinamente abierto y colaborativo hacia la inteligencia artificial a escala fronteriza. Buscamos un ingeniero de plataforma de entrenamiento ML para diseñar, construir y escalar las infraestructuras fundamentales que impulsan nuestra plataforma de formación ML descentralizado. Usted poseerá sistemas centrales que abarcan la orquestación de infraestructura, computadora distribuida y integración de servicios, permitiendo la experimentación continua de fallos y capacitación de modelos a gran escala. Clusters de GPU, NVIDIA runtime, S3 checkpointing, gestión y transmisión de grandes conjuntos de datos, monitoreo de salud y estrategias resilientes de retraso. What Youll Bring Ideally, tendrá más de 5 años de experiencia laboral con una profunda experiencia en: Infrastructure & Platform Engineering: Experiencia de producción con la infraestructura como código (Pulumi/Tform A/Cloud Formation) mientras se gestiona el cambio dinámico del nodo y garantiza un flujo eficiente de datos entre los trabajadores con conectividad heterogénea, ya que nuestra capacitación tiene lugar en nodos de consumo y infraestructuras no ubicadas en un centro de datos. Pluralis es sin disculpas ideológica. Vemos al mundo como un lugar mejor si somos capaces de implementar lo que estamos intentando, y el aprendizaje del protocolo como el único enfoque plausible para evitar que unas pocas corporaciones masivas monopolicen el desarrollo de modelos, el acceso y la liberación, y logren una captura económica masiva. ...si esto resuena, por favor aplique. #J-18808-Ljbffr