Mediabistro logo
job logo

Amazon is hiring: Sr. Machine Learning Engineer, Amazon General Intelligence (AG

Amazon, San Francisco, CA, United States


Sr. Machine Learning Engineer, Amazon General Intelligence (AGI) Our Machine Learning training infrastructure (ML Infra) team is responsible for designing, implementing, and optimizing large-scale computing infrastructure that powers our cutting-edge AI and machine learning initiatives. We leverage advanced hardware, innovative software architectures, and distributed computing techniques to enable breakthrough research and product development across the company.
We are seeking a Senior Machine Learning Engineer to join our team and lead the development of our next-generation ML training infrastructure. This is a high impact, high visibility role that will shape the future of our machine learning capabilities and contribute to the advancement of AI technology across the industry.
Key Responsibilities Lead the definition, design, architecture quality, implementation, and delivery of the most advanced, complex and cross-cutting challenges spanning our ML infrastructure.
Align teams in ML Infrastructure and related organizations to a coherent technical vision and deliver systems that fit well together.
Influence multiple teams, increasing their productivity and effectiveness. Hold peers and teams to a high bar for performance and efficiency.
Guide difficult trade-off decisions and drive awareness about the impact and consequences of technical decisions on AI research and product development.
Demonstrate significant innovation, creativity, and judgment when solving challenging AI/ML infrastructure problems.
Identify future skills needed across the organization and advocate for skill development or acquisition to senior leaders. Scout top talent and recruit them to the company.
Actively mentor senior and Principal engineers, scale yourself by developing and institutionalizing best practices in AI/ML infrastructure and distributed computing across the organization.
A day in the life 8+ years of professional software development experience in distributed systems with emphasis on ML infrastructure.
8+ years of current programming experience building ML infrastructure using languages such as Python, C++, or Rust.
Hands‑on experience with parallel computing platforms such as CUDA, OpenMP, etc.
Deep understanding of AI frameworks such as PyTorch, TensorFlow, and JAX, and their demands on underlying compute infrastructure, memory bandwidth, network interconnect, and storage as scale goes up.
Knowledge of emerging AI hardware accelerators and architectures.
Experience with containerization and orchestration technologies (Docker, Kubernetes).
Experience with cloud computing platforms (AWS, Azure, GCP) and their offerings.
Basic Qualifications 5+ years of non‑internship professional software development experience.
5+ years of programming with at least one software programming language.
5+ years of leading design or architecture of new and existing systems.
Experience as a mentor, tech lead or leading an engineering team.
Preferred Qualifications 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience.
Bachelor's degree in computer science or equivalent.
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status. Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information.

#J-18808-Ljbffr

In Summary: Our Machine Learning training infrastructure team is responsible for designing, implementing, and optimizing large-scale computing infrastructure that powers our cutting-edge AI and machine learning initiatives . We leverage advanced hardware, innovative software architectures, and distributed computing techniques to enable breakthrough research and product development across the company .

En Español: Sr. Ingeniero de aprendizaje automático, Amazon General Intelligence (AGI) Nuestro equipo del Machine Learning Training Infrastructure (ML Infra) es responsable de diseñar, implementar y optimizar la infraestructura informática a gran escala que impulsa nuestras iniciativas avanzadas de IA y machine learning. Aprovechamos hardware avanzado, arquitecturas innovadoras de software y técnicas de computación distribuida para permitir una investigación e desarrollo de productos revolucionarios en toda la compañía. Buscamos un ingeniero senior de Machine Learning para unirse a nuestro equipo y liderar el desarrollo de nuestra infraestructura de capacitación ML de próxima generación. Este es un papel de alto impacto, alta visibilidad que dará forma a las capacidades futuras de inteligencia artificial y contribuirá al avance de la tecnología AI en todo el sector. Mantener a sus compañeros y equipos en una alta barrera para el rendimiento y la eficiencia. Dirigir las decisiones difíciles de compensación e impulsar la conciencia sobre el impacto y consecuencias de las decisiones técnicas en la investigación y desarrollo de productos de IA. Demostrar innovación, creatividad y juicio significativos al resolver problemas desafiantes de infraestructura AI / ML. Identificar habilidades futuras necesarias en toda la organización y abogar por el desarrollo o adquisición de habilidades para los líderes superiores. Explorar los mejores talentos y reclutarlos para la empresa. Mentor a ingenieros seniores y principales, asesore activamente a sí mismo mediante el desarrollo e institucionalización de las mejores prácticas en infraestructura AI/ML y computación distribuida en toda la organización. Un día en la vida 8+ años de experiencia profesional en el desarrollo de software en sistemas distribuidos con énfasis en la infraestructura ML. 8+ año de experiencia actual en programación que construye infraestructuras ML utilizando lenguajes como Python, C++, o Rust. Experiencia práctica con plataformas informáticas paralelas como CUDA, OpenMP, etc. Comprensión profunda de los marcos de IA tales como PyTorch, TensorFlow y JAX, y sus demandas sobre infraestructura informática subyacente, ancho de banda de memoria, interconexión de red y almacenamiento a medida que aumenta. El conocimiento de los estándares emergentes de aceleradores de hardware y arquitecturas. Nuestra cultura inclusiva permite a los amazónicos ofrecer los mejores resultados para nuestros clientes. Si usted tiene una discapacidad y necesita un alojamiento o ajuste en el lugar de trabajo durante el proceso de solicitud y contratación, visite https://amazon.jobs/content/en/how-we-hire/accommodations para obtener más información. #J-18808-Ljbffr