Machine Learning Engineer, Training Infrastructure Job at Hedra, Inc in San Fran

Hedra, Inc, San Francisco, CA, United States

About Hedra
Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures. We're building Hedra Studio, a multimodal creation platform capable of control, emotion, and creative intelligence.
At the core of Hedra Studio is our Character-3 foundation model, the first omnimodal model in production. Character-3 jointly reasons across image, text, and audio for more intelligent video generation — it’s the next evolution of AI-driven content creation.
At Hedra, we’re a team of hard-working, passionate individuals seeking to fundamentally change content creation and build a generational company together. We value startup energy, initiative, and the ability to turn bold ideas into real products. Our team is fully in-person in SF/NY with a shared love for whiteboard problem-solving.
Overview
We are looking for an ML Engineer with 3+ YOE in high-performance computing systems to manage and optimize our computational infrastructure for training and deploying our machine learning models. The ideal candidate has diverse experience managing ML workloads at scale, supporting our 3DVAE and video diffusion models. We encourage you to apply even if you don't meet every requirement — we value curiosity, creativity, and the drive to solve hard problems.

Responsibilities
Design, implement, and maintain scalable computing solutions for training and deploying ML models, ensuring infrastructure can handle large video datasets.

Manage and optimize the performance of our computing clusters or cloud instances, such as AWS or Google Cloud, to support distributed training.

Ensure that our infrastructure can handle the resource-intensive tasks associated with training large generative models.

Monitor system performance and implement improvements to maximize efficiency and utilization , using tools like Airflow for orchestration.

Collaborate across research teams to understand their computational needs and provide appropriate solutions, facilitating seamless model deployment.

Qualifications
Bachelor’s degree in Computer Science, Information Technology, or a related field, with a focus on system administration.

Experience with cloud computing platforms such as Amazon Web Services, Google Cloud, or Microsoft Azure, essential for managing large-scale ML workloads.

Values engineering processes and version control (CI/CD).

Knowledge of containerization technologies like Docker and Kubernetes required for deployments at scale.

Understanding of distributed training techniques and how to scale models across multi-node clusters aligning with video generation needs.

Strong problem-solving and communication skills, given the need to collaborate with diverse teams.

This role is vital for ensuring the computational backbone supports the company’s ML efforts, focusing on deployment and scalability.
Benefits
Competitive compensation + equity

401k (no match)

Healthcare (Silver PPO Medical, Vision, Dental)

Lunch and snacks at the office

#J-18808-Ljbffr

In Summary: Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures . We're building Hedra Studio, a multimodal creation platform capable of control, emotion, and creative intelligence . The ideal candidate has diverse experience managing ML workloads at scale .

En Español: Hedra es una empresa de medios generativos pionera respaldada por los principales inversores de Index, A16Z y Abstract Ventures. Estamos construyendo Hedra Studio, una plataforma de creación multimodal capaz de controlar, emocionar e inteligencia creativa. En el núcleo de Hedra Studios está nuestro modelo fundacional Character-3, el primer modelo omnimodal en producción. Te animamos a aplicar incluso si no cumples con todos los requisitos Valoramos la curiosidad, la creatividad y el impulso para resolver problemas difíciles. Responsabilidades Diseñar, implementar y mantener soluciones de computación escalables para entrenamiento e implementación de modelos ML, asegurando que las infraestructuras puedan manejar grandes conjuntos de datos de video. Gestionar y optimizar el rendimiento de nuestros grupos informáticos o instancias de nube, como AWS o Google Cloud, para apoyar la capacitación distribuida. Asegurarnos de que nuestra infraestructura pueda manejar las tareas intensivas en recursos asociadas con la formación de modelos generacionales grandes. Monitorear el desempeño del sistema e implementar mejoras para maximizar la eficiencia y utilización , utilizando herramientas como Airflow para orquestación. Colaborar entre equipos de investigación para comprender sus necesidades computacionales y proporcionar soluciones apropiadas, garantizando una evaluación adecuada del modelo. Las calificaciones de Bachillerato en Tecnología de Computación o un campo de trabajo basado en conocimientos básicos incluyen capacidades técnicas de gestión de sistemas de computadoras, servicios médicos, etc. Concentrarse en tecnologías digitales (como Microsoft Windows Vista, V80, LSDL) y tecnología de comunicación a gran escala.