Principal Machine Learning Engineer - Reliability San Mateo, CA, United States E

Roblox Corporation, San Mateo, CA, United States

Principal Machine Learning Engineer - Reliability
Every day, tens of millions of people come to Roblox to explore, create, play, learn, and connect with friends in 3D immersive digital experiences– all created by our global community of developers and creators.

At Roblox, we’re building the tools and platform that empower our community to bring any experience that they can imagine to life. Our vision is to reimagine the way people come together, from anywhere in the world, and on any device. We’re on a mission to connect a billion people with optimism and civility, and looking for amazing talent to help us get there.

A career at Roblox means you’ll be working to shape the future of human interaction, solving unique technical challenges at scale, and helping to create safer, more civil shared experiences for everyone.

Why Reliability?

Roblox serves over 100 million people every day across a platform that is constantly evolving — and behind every experience is infrastructure that has to work, every time, at massive scale. The Reliability team at Roblox operates at the depth and breadth of the Roblox stack. Availability of the platform is a key company goal. We are hiring our first Principal Machine Learning engineer within our team.

As a Principal Machine Learning Engineer within Reliability, you will set the 3-5 year technical strategy and architectural blueprint for how machine learning systems/practices can be leveraged to improve the reliability of the overall Roblox platform. You will own the architectural and execution roadmap of leveraging massive data across - logs, traces, metrics, production changes, to proactively detect issues before they become real problems (MTTD) and/or reduce time to resolve incidents (MTTR). You will have the opportunity to cross functionally collaborate with other similar teams at Roblox to define best practices and software.

You will:

Define and Own the Technical Vision: Define and lead the multi-year technical vision, architectural strategy, and execution for machine learning solutions in Content Safety, ensuring these systems proactively and effectively detect and mitigate violative content at massive scale.

Strategic Stakeholder Partnership: Collaborate with executive-level Product, Data Science, Policy, and Operations leaders to define and prioritize the strategic machine learning roadmap, influencing product strategy and demonstrating the impact of ML on user trust and safety outcomes.

Lead Innovation: Oversee the adoption and safe deployment of innovative machine learning techniques (e.g., transfer-learning, self-supervised learning, quantization, LoRA, distillation).

Drive End-to-End Product Development: You will not just model; you will build. You will work cross-functionally to construct datasets from scratch where none exist, build auto-labeling pipelines, and ship solutions to solve novel technical problems.

Ship Code, Not Just Models: Expect to spend roughly 30-40% of your time on backend and integration work . You will be responsible for integrating your work into the production stack, leveraging modern AI coding tools (e.g., Cursor) to accelerate velocity and handle infrastructure complexity

You have:

8+ years of experience designing, developing, and operating large-scale, high-impact machine learning systems in a production environment.

A proven track record of successfully setting the long-term technical direction for an entire ML domain, demonstrating the ability to take ambiguous problems from concept to scaled production impact.

Deep expertise in advanced ML architectures and techniques, including Computer Vision (CV) and/or Vision-Language Models (VLMs)

Expertise in architecting scalable, real-time ML inference services and robust data pipelines

Demonstrated success in leading and resolving high-stakes, cross-functional conflicts and technical disagreements, with an ability to build consensus among diverse stakeholders.

Exceptional product sense and strategic planning ability: able to translate platform safety requirements into an achievable, iterative technical roadmap.

You are:

A Visionary Architect: Capable of synthesizing complex business and safety goals into a clear, compelling, and actionable technical strategy.

A Pragmatic Builder: You are scrappy and impact-oriented. You view undefined data and messy systems as opportunities to build structure rather than blockers to progress.

Comfortable with Ambiguity: You thrive in undefined or open-ended problem spaces, providing structure, clarity, and decisive direction to your teams.

An Inspiring Leader: Passionate about developing the next generation of technical leaders, managers, and engineers.

An Executive Communicator: Highly effective at communicating complex technical concepts to both engineering teams and non-technical executive leadership.

Committed to Ethical AI: Dedicated to building ML systems that are fair, transparent, and operate with the utmost responsibility toward user safety and platform civility.

Annual Salary Range

$295,250 — $345,040 USD

Roles that are based in an office are onsite Tuesday, Wednesday, and Thursday, with optional presence on Monday and Friday (unless otherwise noted).

Roblox provides equal employment opportunities to all employees and applicants for employment and prohibits discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws. Roblox also provides reasonable accommodations to candidates with qualifying disabilities or religious beliefs during the recruiting process.

#J-18808-Ljbffr

In Summary: As a Principal Machine Learning Engineer, you will set the 3-5 year technical strategy and architectural blueprint for how machine learning systems/practices can be leveraged to improve the reliability of the overall Roblox platform . You will own the architectural and execution roadmap of leveraging massive data across - logs, traces, metrics, production changes .

En Español: Cada día, decenas de millones de personas vienen a Roblox para explorar, crear, jugar, aprender y conectarse con amigos en experiencias digitales inmersivas 3D creadas por nuestra comunidad global de desarrolladores y creadores. En Roblox, estamos construyendo las herramientas y plataformas que permitan a nuestra comunidad dar vida a cualquier experiencia que puedan imaginar. Nuestra visión es reimaginar la forma en que la gente se une, desde cualquier lugar del mundo y en cualquier dispositivo. Estamos en una misión de conectar a millardos de personas con optimismo y civilidad, y buscamos una carrera increíble para ayudarnos a llegar allí. Un talento clave en Roblox significa que estarás trabajando para moldear el futuro de la interacción humana, resolver desafíos técnicos únicos a escala y ayudar a crear experiencias compartidas más seguras y civiles para todos. ¿Por qué fiabilidad? Estamos contratando a nuestro primer ingeniero principal de aprendizaje automático dentro de nuestro equipo. Como Ingeniero Principal de Machine Learning en Reliabilidad, establecerá la estrategia técnica y el plan arquitectónico de 3 a 5 años para cómo se pueden aprovechar los sistemas/prácticas de machine learning para mejorar la confiabilidad de la plataforma Roblox global. Poseerá la hoja de ruta arquitectónica y ejecutiva de aprovechamiento de datos masivos - registros, rastros, métricas, cambios de producción, para detectar proactivamente problemas antes de que se conviertan en problemas reales (MTTD) y / o reducir el tiempo para resolver incidentes (MTTR). Tendrá la oportunidad de colaborar funcionalmente con otros equipos similares en Roblox para definir las mejores prácticas y software. Usted: Definir y poseer la Visión Técnica: Definir y liderar la visión técnica plurianual, estrategia arquitectónica y ejecución de soluciones de aprendizaje automático en Seguridad del Contenido, asegurando que estos sistemas detecten y mitigen los contenidos violadores a gran escala de manera proactiva y efectiva. Usted será responsable de integrar su trabajo en la pila de producción, aprovechando modernas herramientas de codificación AI (por ejemplo, Cursor) para acelerar la velocidad y manejar la complejidad de la infraestructura. Tiene: 8+ años de experiencia en el diseño, desarrollo y operación de sistemas de aprendizaje automático a gran escala y alto impacto en un entorno productivo. Un historial comprobado de establecer con éxito la dirección técnica a largo plazo para todo un dominio ML, demostrando la capacidad de tomar problemas ambiguos desde el concepto hasta los impactos de producción escalados. Un líder inspirador: apasionado por desarrollar la próxima generación de líderes técnicos, gerentes e ingenieros. un comunicador ejecutivo: muy eficaz en comunicar conceptos técnicos complejos tanto a los equipos de ingeniería como al liderazgo ejecutiva no técnico. comprometido con la inteligencia artificial ética: dedicado a construir sistemas ML que sean justos, transparentes y operen con la máxima responsabilidad sobre seguridad del usuario y la plataforma.