GPU Systems / AI Infra Engineer

Darwin Recruitment, New York, NY, United States

Senior GPU Systems / AI Infrastructure Engineer (NYC)
Location:

New York City (Hybrid / On-site preferred)

Comp:

Competitive + equity (Series A-C / high-growth AI infra)

About the Role
We’re hiring a senior-level engineer to build and optimise next-generation AI infrastructure powering large-scale model training and inference. This role sits at the intersection of

GPU systems, kernel optimisation, distributed compute, and high-performance AI workloads .

You’ll work directly on the performance layer of modern AI stacks-where milliseconds matter, GPUs are saturated, and inefficiencies translate directly into cost and latency at scale.

This is a deeply technical role for engineers who are comfortable working close to the metal and care about squeezing every ounce of performance out of modern accelerators (NVIDIA, AMD, and emerging architectures).

What You’ll Work On

Design and optimise

GPU kernels (CUDA / Triton / HIP)

for large-scale AI workloads

Build and tune

high-performance inference and training pipelines

for LLMs and multimodal models

Work on

distributed systems for AI training (multi-node, multi-GPU clusters)

Improve

memory bandwidth utilisation, kernel fusion, and compute efficiency

Contribute to or extend frameworks like

PyTorch, JAX, or custom runtimes

Build tooling for

profiling, benchmarking, and performance regression detection

Collaborate closely with ML researchers and infra engineers to remove system bottlenecks

What We’re Looking For (Core Profile / MPC Fit)
You’re likely a strong match if you have:

5-10+ years in

systems engineering, HPC, GPU computing, or AI infrastructure

Deep experience with

CUDA programming and GPU kernel optimisation

Strong understanding of

parallel computing, memory hierarchies, and compute bottlenecks

Experience with

distributed systems (Ray, MPI, NCCL, custom cluster orchestration, etc.)

Background in

high-performance C++ / Rust / Python systems

Experience working on

training or inference stacks for large-scale ML models

Strong intuition for

performance profiling (Nsight, perf, flamegraphs, etc.)

Nice to Have

Experience with

Triton, TVM, or MLIR-based compiler stacks

Exposure to

kernel fusion, graph compilation, or runtime optimisation

Experience at

AI infra startups, hyperscalers, or HPC environments

Familiarity with

quantisation, KV caching, or inference acceleration techniques

Contributions to

open-source ML systems or GPU libraries

Background in

CUDA graph execution, stream scheduling, or warp-level optimisation

Why This Role

Work on the

critical performance layer of AI systems (not application-level ML)

Direct impact on

cost, latency, and scalability of frontier AI models

High autonomy-own entire subsystems (kernel → runtime → distributed execution)

NYC-based team building at the forefront of

AI infrastructure and compute optimisation

Opportunity to shape systems used at

massive scale in production ML workloads

#J-18808-Ljbffr