
GPU Systems / AI Infra Engineer
Darwin Recruitment, New York, NY, United States
Senior GPU Systems / AI Infrastructure Engineer (NYC)
Location:
New York City (Hybrid / On-site preferred)
Comp:
Competitive + equity (Series A-C / high-growth AI infra)
About the Role
We’re hiring a senior-level engineer to build and optimise next-generation AI infrastructure powering large-scale model training and inference. This role sits at the intersection of
GPU systems, kernel optimisation, distributed compute, and high-performance AI workloads .
You’ll work directly on the performance layer of modern AI stacks-where milliseconds matter, GPUs are saturated, and inefficiencies translate directly into cost and latency at scale.
This is a deeply technical role for engineers who are comfortable working close to the metal and care about squeezing every ounce of performance out of modern accelerators (NVIDIA, AMD, and emerging architectures).
What You’ll Work On
Design and optimise
GPU kernels (CUDA / Triton / HIP)
for large-scale AI workloads
Build and tune
high-performance inference and training pipelines
for LLMs and multimodal models
Work on
distributed systems for AI training (multi-node, multi-GPU clusters)
Improve
memory bandwidth utilisation, kernel fusion, and compute efficiency
Contribute to or extend frameworks like
PyTorch, JAX, or custom runtimes
Build tooling for
profiling, benchmarking, and performance regression detection
Collaborate closely with ML researchers and infra engineers to remove system bottlenecks
What We’re Looking For (Core Profile / MPC Fit)
You’re likely a strong match if you have:
5-10+ years in
systems engineering, HPC, GPU computing, or AI infrastructure
Deep experience with
CUDA programming and GPU kernel optimisation
Strong understanding of
parallel computing, memory hierarchies, and compute bottlenecks
Experience with
distributed systems (Ray, MPI, NCCL, custom cluster orchestration, etc.)
Background in
high-performance C++ / Rust / Python systems
Experience working on
training or inference stacks for large-scale ML models
Strong intuition for
performance profiling (Nsight, perf, flamegraphs, etc.)
Nice to Have
Experience with
Triton, TVM, or MLIR-based compiler stacks
Exposure to
kernel fusion, graph compilation, or runtime optimisation
Experience at
AI infra startups, hyperscalers, or HPC environments
Familiarity with
quantisation, KV caching, or inference acceleration techniques
Contributions to
open-source ML systems or GPU libraries
Background in
CUDA graph execution, stream scheduling, or warp-level optimisation
Why This Role
Work on the
critical performance layer of AI systems (not application-level ML)
Direct impact on
cost, latency, and scalability of frontier AI models
High autonomy-own entire subsystems (kernel → runtime → distributed execution)
NYC-based team building at the forefront of
AI infrastructure and compute optimisation
Opportunity to shape systems used at
massive scale in production ML workloads
#J-18808-Ljbffr
Location:
New York City (Hybrid / On-site preferred)
Comp:
Competitive + equity (Series A-C / high-growth AI infra)
About the Role
We’re hiring a senior-level engineer to build and optimise next-generation AI infrastructure powering large-scale model training and inference. This role sits at the intersection of
GPU systems, kernel optimisation, distributed compute, and high-performance AI workloads .
You’ll work directly on the performance layer of modern AI stacks-where milliseconds matter, GPUs are saturated, and inefficiencies translate directly into cost and latency at scale.
This is a deeply technical role for engineers who are comfortable working close to the metal and care about squeezing every ounce of performance out of modern accelerators (NVIDIA, AMD, and emerging architectures).
What You’ll Work On
Design and optimise
GPU kernels (CUDA / Triton / HIP)
for large-scale AI workloads
Build and tune
high-performance inference and training pipelines
for LLMs and multimodal models
Work on
distributed systems for AI training (multi-node, multi-GPU clusters)
Improve
memory bandwidth utilisation, kernel fusion, and compute efficiency
Contribute to or extend frameworks like
PyTorch, JAX, or custom runtimes
Build tooling for
profiling, benchmarking, and performance regression detection
Collaborate closely with ML researchers and infra engineers to remove system bottlenecks
What We’re Looking For (Core Profile / MPC Fit)
You’re likely a strong match if you have:
5-10+ years in
systems engineering, HPC, GPU computing, or AI infrastructure
Deep experience with
CUDA programming and GPU kernel optimisation
Strong understanding of
parallel computing, memory hierarchies, and compute bottlenecks
Experience with
distributed systems (Ray, MPI, NCCL, custom cluster orchestration, etc.)
Background in
high-performance C++ / Rust / Python systems
Experience working on
training or inference stacks for large-scale ML models
Strong intuition for
performance profiling (Nsight, perf, flamegraphs, etc.)
Nice to Have
Experience with
Triton, TVM, or MLIR-based compiler stacks
Exposure to
kernel fusion, graph compilation, or runtime optimisation
Experience at
AI infra startups, hyperscalers, or HPC environments
Familiarity with
quantisation, KV caching, or inference acceleration techniques
Contributions to
open-source ML systems or GPU libraries
Background in
CUDA graph execution, stream scheduling, or warp-level optimisation
Why This Role
Work on the
critical performance layer of AI systems (not application-level ML)
Direct impact on
cost, latency, and scalability of frontier AI models
High autonomy-own entire subsystems (kernel → runtime → distributed execution)
NYC-based team building at the forefront of
AI infrastructure and compute optimisation
Opportunity to shape systems used at
massive scale in production ML workloads
#J-18808-Ljbffr