Senior Systems HPC Engineer

Jobgether, New Bremen, OH, United States

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Systems HPC Engineer in Germany.

This role sits at the core of next-generation AI cloud infrastructure, focusing on the performance, reliability, and scalability of large-scale high-performance computing (HPC) systems. You will work across the full technology stack, from hardware and low-level system software to networking and distributed compute frameworks. Your mission will be to analyze, optimize, and troubleshoot complex GPU cluster environments running demanding AI workloads. Operating in a highly technical and collaborative setting, you will interact with leading infrastructure, hardware, and software experts worldwide. The role combines deep systems engineering with performance tuning and architectural analysis. It is ideal for an engineer passionate about understanding how large‑scale computing systems behave under real‑world AI demands and improving them at every layer.

Accountabilities

Analyze and optimize the performance of large-scale GPU clusters across hardware and software layers

Investigate and troubleshoot complex system and workload performance issues, including AI training and inference workloads

Evaluate, test, and integrate new hardware components, system configurations, and tuning strategies

Support escalations related to performance issues from internal teams and external stakeholders

Collaborate closely with infrastructure, software engineering, and hardware vendors to improve system performance and reliability

Contribute to cluster qualification and validation processes, ensuring systems meet defined performance benchmarks

Identify bottlenecks across networking, compute, and storage layers and drive technical improvements

Enhance system observability and performance monitoring practices across HPC environments

Requirements

5+ years of experience in system-level software engineering with a focus on performance optimization

3+ years of hands‑on experience with Linux systems administration, troubleshooting, and performance tuning

Strong understanding of server architecture, including PCIe, NICs, kernel‑level components, and HPC systems

Experience with high‑performance networking technologies such as InfiniBand or RoCE

Solid programming skills in C/C++, Go, or Python, with a performance‑oriented mindset

Strong analytical and debugging skills across distributed and low‑level systems

Experience working in complex, large‑scale infrastructure environments is highly desirable

Ability to collaborate with cross‑functional engineering teams and hardware vendors

Comfortable working in environments where coding assessments are part of the hiring process

Benefits

Competitive salary and comprehensive benefits package

Flexible working arrangements, including remote options

Opportunities for professional growth in a cutting‑edge AI infrastructure environment

Exposure to large‑scale GPU clusters and advanced HPC technologies

Collaborative, engineering‑driven culture focused on innovation and technical excellence

Work alongside global experts in AI, systems engineering, and hardware development

High‑impact role contributing directly to next‑generation AI compute platforms

#J-18808-Ljbffr