Guild is hiring: AI Engineer, Agents & Evaluation in San Francisco

Guild, San Francisco, CA, United States

AI Engineer, Agents & Evaluation

Guild.ai

San Francisco, CA

The Opportunity

Were looking for our first AI Engineer focused on agents and evaluationa foundational hire who will shape how we build, measure, and scale intelligent systems.

Help developers understand, evolve, and operate complex systems using autonomous and event?driven AI. Build evaluation frameworks, task harnesses, and orchestration strategies that make our agents reliable, testable, and genuinely useful. Create reusable benchmarks and artifacts that will inspire new approaches and push forward the broader foundation model ecosystem.

Enjoy designing experiments, building systems, and iterating tightly between theory and code in a 0?1 research?engineering style role.

What You Will Do

Create Task Evaluations That Matter: Design and implement task?specific evaluations that measure and improve agent quality. Each evaluation should both drive concrete iteration on our agents and spark broader innovation around the task itself.
Define Tasks, Datasets, and Harnesses: Clearly specify tasks, collect and curate balanced datasets, and build robust evaluation harnesses that can be used across agents and modeling approaches. There is ample room for architectural design and systems thinking here.
Build and Use a Reusable Evaluation Framework: Develop frameworks and tools for running evaluations at scale. Use these frameworks to tune existing agents and to guide the development of new ones in our environment.
Explore Agent Orchestration Strategies: Investigate and implement orchestration patterns (tooling, routing, decomposition, multi?agent setups, etc.) that allow agents to tackle increasingly complex, multi?step, and long?horizon tasks.
Apply Post?Training Techniques: Experiment with post?training approaches (e.g., fine?tuning, preference optimization, reward shaping, distillation) to produce high?performance models tailored to specific tasks and workflows.
Run Experiments End?to?End: Design, run, and analyze experiments with rigor. Turn experimental results into clear recommendations and concrete changes to model configurations, prompts, and system design.
Collaborate Deeply Across the Stack: Work closely with founders, product, and infrastructure engineers to ensure evaluations, agents, and platform primitives all reinforce each other.

What You Will Bring

MS or Ph.D. in a relevant field (e.g., Computer Science, Machine Learning, NLP) or equivalent practical experience.
Strong background in machine learning and large language models, ideally including both research and hands?on implementation.
25 years working with LLM technology, with familiarity across:
- Prompting and interaction patterns
- Agent and tool orchestration strategies
- Evaluation strategies for complex, open?ended tasks
Proficiency writing production?quality code, especially in Python; comfort working with TypeScript or modern web/backend stacks.
Experience designing and running experiments, and interpreting results in messy, real?world settings.
Self?motivated, comfortable operating in an unstructured, high?ambiguity environment.
Strong communication skills and the ability to translate vague goals into concrete, testable setups.

Bonus Points

Experience building agentic systems (tool?using agents, workflows, or multi?agent systems) in real products.
Prior work on model evaluation frameworks, benchmarking, or reliability/robustness testing.
Familiarity with modern ML tooling (training/inference stacks, experiment tracking, data pipelines).
Contributions to open?source LLM, tooling, or evaluation projects.
Experience at an early?stage startup or research lab where you owned projects end?to?end.

Benefits & Perks

Significant equity in an early?stage, venture?backed startup.
Comprehensive Health Benefits (Medical, Dental, Vision).
Flexible PTO to ensure you have the time you need to recharge.

Referrals increase your chances of interviewing at Guild.ai by 2x.

Seniority level: Mid?Senior level

Employment type: Full?time

Job function: Engineering and Information Technology

Industries: Software Development

Get notified about new Artificial Intelligence Engineer jobs in San?Francisco, CA.

#J-18808-Ljbffr

In Summary: AI Engineer will shape how we build, measure, and scale intelligent systems . Build evaluation frameworks, task harnesses, and orchestration strategies that make our agents reliable, testable, and genuinely useful . Enjoy designing experiments, building systems, and iterating tightly between theory and code in a research?engineering style role .

En Español:

Ingeniero de IA, agentes y evaluación

Guild.ai

San Francisco, CA

La oportunidad

Buscábamos a nuestro primer ingeniero de inteligencia artificial enfocado en agentes y evaluar la contratación fundamental que dará forma a cómo construimos, medimos y escalamos sistemas inteligentes.

Ayudar a los desarrolladores a comprender, evolucionar y operar sistemas complejos utilizando IA autónoma y impulsada por eventos. Construir marcos de evaluación, arneses de tareas y estrategias de orquestación que hagan confiables, verificables y genuinamente útiles a nuestros agentes. Crear puntos de referencia reutilizables y artefactos que inspiren nuevos enfoques e impulsen el ecosistema más amplio del modelo fundacional.

Disfrute del diseño de experimentos, la construcción de sistemas e iterar estrechamente entre teoría y código en un papel estilo ingeniería.

Qué vas a hacer

Crear Evaluaciones de tareas que importan: diseñar e implementar evaluaciones específicas para medir y mejorar la calidad del agente.
Definir tareas, conjuntos de datos y aranceles: especificar claramente las funciones, recopilar y curar conjuntos equilibrados de datos, y construir robustos araneos de evaluación que pueden usarse en todos los agentes y enfoques de modelado. Hay amplio espacio para el diseño arquitectónico y la reflexión sobre sistemas aquí.
Construir y utilizar un marco de evaluación reutilizable: Desarrollar marcos e instrumentos para ejecutar evaluaciones a gran escala.
Explorar las estrategias de orquestación del agente: investigar e implementar patrones de orquestamiento (herramientas, enrutamiento, descomposición, configuraciones multi-agentes, etc.) que permitan a los agentes abordar tareas cada vez más complejas, multispuestas y horizontales.
Aplicar técnicas de formación posterior: Experimentar con enfoques posteriores a la formación (por ejemplo, ajuste fino, optimización de preferencias, formación de recompensas, destilación) para producir modelos de alto rendimiento adaptados a tareas y flujos de trabajo específicos.
Ejecutar Experimentos Final a final: diseñar, ejecutar y analizar experimentos con rigor. Transformar los resultados experimentales en recomendaciones claras y cambios concretos en las configuraciones de modelos, instrucciones y diseño del sistema.
Colaborar en profundidad a través de la pila: Trabajar estrechamente con los fundadores, ingenieros de productos e infraestructuras para garantizar que las evaluaciones, agentes y primitivos de plataformas se refuercen mutuamente.

Lo que traerás

MS o doctorado en un campo relevante (por ejemplo, Ciencias de la Computación, Aprendizaje Automático, PNL) o experiencia práctica equivalente.
Una sólida formación en aprendizaje automático y grandes modelos lingüísticos, incluidos idealmente tanto la investigación como las prácticas de implementación.
25 años trabajando con la tecnología LLM, con conocimiento de:
- Modelos de impulso e interacción
- Estrategias de orquestación de agentes y herramientas
- Estrategias de evaluación para tareas complejas y abiertas
Proficiencia en escribir código de calidad, especialmente en Python; comodidad para trabajar con TypeScript o las pilas web/backend modernas.
Experiencia en diseñar y ejecutar experimentos e interpretar resultados en entornos desordenados del mundo real.
Auto-motivado, cómodo operando en un entorno no estructurado y de alta ambigüedad.
Fuertes habilidades de comunicación y la capacidad para traducir objetivos vagos en configuraciones concretas, verificables.

Puntos de bonificación

Experiencia en la construcción de sistemas agentes (agentes que utilizan herramientas, flujos de trabajo o sistemas multi-agente) en productos reales.
Trabajo previo sobre los marcos de evaluación del modelo, el benchmarking o las pruebas de fiabilidad/robustez.
Familiarización con las modernas herramientas ML (pillas de formación/inferencia, seguimiento experimental, tuberías de datos).
Contribuciones a proyectos de LLM, instrumentación o evaluación.
Experiencia en una fase inicial o laboratorio de investigación donde poseía proyectos del final al fin.

Beneficios y beneficios

Capitales significativos en una startup inicial respaldada por empresas.
Beneficios generales para la salud (medicina, odontológica y visión).
PTO flexible para asegurarse de que tenga el tiempo necesario para recargar.

Las referencias aumentan sus posibilidades de entrevistarse en Guild.ai en 2 veces.

Nivel de antigüedad: nivel medio y superior

Tipo de empleo: a tiempo completo

Función de trabajo: Ingeniería y Tecnología de la Información

Industria: Desarrollo de software

Obtenga información sobre nuevos trabajos de ingeniero de inteligencia artificial en San Francisco, CA.

#J-18808-Ljbffr