
Applied AI / Evaluation Engineer
NAVEX Global, Inc., Charlotte, NC, United States
As an Applied AI / Evaluation Engineer, you will own the quality, measurement, and behavioral assurance of the NAVEX AI Product System. You will build and operate evaluation harnesses, quality gating mechanisms, and human-in-the-loop tooling that ensure AI behavior is safe, consistent, and improving over time. In an agentic context, you will create the evaluation and regression testing systems that reduce drift and make agent behavior predictable—integrating continuous evaluation into CI/CD and production monitoring. You will be the guardian of AI quality, ensuring that no AI capability reaches production without passing rigorous evaluation. If you want to ensure enterprise agentic AI systems are trustworthy and measurably excellent, this role is for you.
You’ll thrive in this hybrid role surrounded by an engaged, collaborative team deeply committed to your success. Join us and help shape what’s next!
What you’ll get:
Meaningful Purpose.
Your work helps organizations operate with integrity and protect their people—at a scale few companies can match.
High-Performance Environment.
We move with urgency, set ambitious goals, and expect excellence. You’ll be trusted with real ownership and supported to do the best work of your career.
Candid, Supportive Culture.
We communicate openly, challenge ideas—not people—and value teammates who embrace bold thinking and continuous improvement.
Growth That Matters.
You can count on authentic feedback, strong accountability, and leaders invested in your success so you can achieve real growth.
Rewards for Results.
We provide clear, competitive compensation designed to recognize measurable outcomes and real impact.
What you’ll do:
Design, build, and operate the AI evaluation and regression harness that gates all AI releases—developing scenario suites, golden traces, and automated quality gates to reduce drift and make behavior predictable
Define and maintain evaluation dimensions including groundedness, accuracy, relevance, safety, and policy adherence
Build and curate versioned reference datasets (golden sets) covering common usage patterns and known failure modes
Implement LLM-as-judge evaluation pipelines and rationale validation frameworks
Develop and operate human-in-the-loop (HITL) tooling and signal capture systems
Build drift detection and regression tracking capabilities to monitor AI behavioral stability over time
Design quality gates that enforce measurable thresholds before AI capabilities are promoted to production
Instrument agent observability—including end-to-end tracing for agent runs (tool-call success rates, failure analysis, latency and cost monitoring)—and use observability to debug and continuously improve
Normalize and associate human review signals with AI interactions for continuous improvement
Collaborate with data scientists and platform engineers to instrument telemetry across AI system components
Produce evaluation reports and quality metrics that support governance, compliance, and leadership review
What you’ll bring:
Bachelor’s or Master’s degree in Computer Science, Data Science, Statistics, or a related STEM field
5+ years’ experience in ML engineering, AI evaluation, or applied AI quality assurance
Strong experience building evaluation harnesses, regression testing frameworks, and quality gating pipelines for LLM-based systems
Experience with LLM-as-judge methodologies and automated evaluation techniques
Evaluation-first mindset—experience implementing continuous evaluation pipelines that integrate with CI/CD and production monitoring, including stress testing against edge cases and adversarial scenarios
Proficiency in Python with experience in ML/NLP evaluation libraries and frameworks
Knowledge of statistical methods for measuring AI quality, drift, and behavioral stability
Observability literacy for agent decisions—ability to implement or use tooling that evaluates agent behaviors like tool selection and tool argument correctness
Experience designing and implementing human-in-the-loop review workflows
Understanding of AI safety, bias detection, and policy compliance evaluation, with practical security awareness for LLM applications
Comfort working in an iterative “build, test, ship, observe, refine” cycle
Culture Agility.
Comfort working in a fast-paced, candid environment that values innovation, healthy debate, and follow-through
Fuel performance and outcomes.
Leverage your job competencies and champion NAVEX’s core values
Our side of the deal:
We’ll be clear, we’ll move fast, and we’ll invest in your success. You deserve to be supported, challenged, and rewarded for the impact you make—and we commit to doing that every step of the way.
The starting pay for this role is $155,000+ per annum with 15% MBO. Discover how you can grow, lead, and make an impact by visiting our career page to learn more. NAVEX is an equal opportunity employer committed to including individuals of all backgrounds, including those with disabilities and veteran status.
#J-18808-Ljbffr
You’ll thrive in this hybrid role surrounded by an engaged, collaborative team deeply committed to your success. Join us and help shape what’s next!
What you’ll get:
Meaningful Purpose.
Your work helps organizations operate with integrity and protect their people—at a scale few companies can match.
High-Performance Environment.
We move with urgency, set ambitious goals, and expect excellence. You’ll be trusted with real ownership and supported to do the best work of your career.
Candid, Supportive Culture.
We communicate openly, challenge ideas—not people—and value teammates who embrace bold thinking and continuous improvement.
Growth That Matters.
You can count on authentic feedback, strong accountability, and leaders invested in your success so you can achieve real growth.
Rewards for Results.
We provide clear, competitive compensation designed to recognize measurable outcomes and real impact.
What you’ll do:
Design, build, and operate the AI evaluation and regression harness that gates all AI releases—developing scenario suites, golden traces, and automated quality gates to reduce drift and make behavior predictable
Define and maintain evaluation dimensions including groundedness, accuracy, relevance, safety, and policy adherence
Build and curate versioned reference datasets (golden sets) covering common usage patterns and known failure modes
Implement LLM-as-judge evaluation pipelines and rationale validation frameworks
Develop and operate human-in-the-loop (HITL) tooling and signal capture systems
Build drift detection and regression tracking capabilities to monitor AI behavioral stability over time
Design quality gates that enforce measurable thresholds before AI capabilities are promoted to production
Instrument agent observability—including end-to-end tracing for agent runs (tool-call success rates, failure analysis, latency and cost monitoring)—and use observability to debug and continuously improve
Normalize and associate human review signals with AI interactions for continuous improvement
Collaborate with data scientists and platform engineers to instrument telemetry across AI system components
Produce evaluation reports and quality metrics that support governance, compliance, and leadership review
What you’ll bring:
Bachelor’s or Master’s degree in Computer Science, Data Science, Statistics, or a related STEM field
5+ years’ experience in ML engineering, AI evaluation, or applied AI quality assurance
Strong experience building evaluation harnesses, regression testing frameworks, and quality gating pipelines for LLM-based systems
Experience with LLM-as-judge methodologies and automated evaluation techniques
Evaluation-first mindset—experience implementing continuous evaluation pipelines that integrate with CI/CD and production monitoring, including stress testing against edge cases and adversarial scenarios
Proficiency in Python with experience in ML/NLP evaluation libraries and frameworks
Knowledge of statistical methods for measuring AI quality, drift, and behavioral stability
Observability literacy for agent decisions—ability to implement or use tooling that evaluates agent behaviors like tool selection and tool argument correctness
Experience designing and implementing human-in-the-loop review workflows
Understanding of AI safety, bias detection, and policy compliance evaluation, with practical security awareness for LLM applications
Comfort working in an iterative “build, test, ship, observe, refine” cycle
Culture Agility.
Comfort working in a fast-paced, candid environment that values innovation, healthy debate, and follow-through
Fuel performance and outcomes.
Leverage your job competencies and champion NAVEX’s core values
Our side of the deal:
We’ll be clear, we’ll move fast, and we’ll invest in your success. You deserve to be supported, challenged, and rewarded for the impact you make—and we commit to doing that every step of the way.
The starting pay for this role is $155,000+ per annum with 15% MBO. Discover how you can grow, lead, and make an impact by visiting our career page to learn more. NAVEX is an equal opportunity employer committed to including individuals of all backgrounds, including those with disabilities and veteran status.
#J-18808-Ljbffr