
Senior Systems HPC Engineer
Jobgether, New Bremen, OH, United States
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Systems HPC Engineer in Germany.
This role sits at the core of next-generation AI cloud infrastructure, focusing on the performance, reliability, and scalability of large-scale high-performance computing (HPC) systems. You will work across the full technology stack, from hardware and low-level system software to networking and distributed compute frameworks. Your mission will be to analyze, optimize, and troubleshoot complex GPU cluster environments running demanding AI workloads. Operating in a highly technical and collaborative setting, you will interact with leading infrastructure, hardware, and software experts worldwide. The role combines deep systems engineering with performance tuning and architectural analysis. It is ideal for an engineer passionate about understanding how large‑scale computing systems behave under real‑world AI demands and improving them at every layer.
Accountabilities
Analyze and optimize the performance of large-scale GPU clusters across hardware and software layers
Investigate and troubleshoot complex system and workload performance issues, including AI training and inference workloads
Evaluate, test, and integrate new hardware components, system configurations, and tuning strategies
Support escalations related to performance issues from internal teams and external stakeholders
Collaborate closely with infrastructure, software engineering, and hardware vendors to improve system performance and reliability
Contribute to cluster qualification and validation processes, ensuring systems meet defined performance benchmarks
Identify bottlenecks across networking, compute, and storage layers and drive technical improvements
Enhance system observability and performance monitoring practices across HPC environments
Requirements
5+ years of experience in system-level software engineering with a focus on performance optimization
3+ years of hands‑on experience with Linux systems administration, troubleshooting, and performance tuning
Strong understanding of server architecture, including PCIe, NICs, kernel‑level components, and HPC systems
Experience with high‑performance networking technologies such as InfiniBand or RoCE
Solid programming skills in C/C++, Go, or Python, with a performance‑oriented mindset
Strong analytical and debugging skills across distributed and low‑level systems
Experience working in complex, large‑scale infrastructure environments is highly desirable
Ability to collaborate with cross‑functional engineering teams and hardware vendors
Comfortable working in environments where coding assessments are part of the hiring process
Benefits
Competitive salary and comprehensive benefits package
Flexible working arrangements, including remote options
Opportunities for professional growth in a cutting‑edge AI infrastructure environment
Exposure to large‑scale GPU clusters and advanced HPC technologies
Collaborative, engineering‑driven culture focused on innovation and technical excellence
Work alongside global experts in AI, systems engineering, and hardware development
High‑impact role contributing directly to next‑generation AI compute platforms
#J-18808-Ljbffr
This role sits at the core of next-generation AI cloud infrastructure, focusing on the performance, reliability, and scalability of large-scale high-performance computing (HPC) systems. You will work across the full technology stack, from hardware and low-level system software to networking and distributed compute frameworks. Your mission will be to analyze, optimize, and troubleshoot complex GPU cluster environments running demanding AI workloads. Operating in a highly technical and collaborative setting, you will interact with leading infrastructure, hardware, and software experts worldwide. The role combines deep systems engineering with performance tuning and architectural analysis. It is ideal for an engineer passionate about understanding how large‑scale computing systems behave under real‑world AI demands and improving them at every layer.
Accountabilities
Analyze and optimize the performance of large-scale GPU clusters across hardware and software layers
Investigate and troubleshoot complex system and workload performance issues, including AI training and inference workloads
Evaluate, test, and integrate new hardware components, system configurations, and tuning strategies
Support escalations related to performance issues from internal teams and external stakeholders
Collaborate closely with infrastructure, software engineering, and hardware vendors to improve system performance and reliability
Contribute to cluster qualification and validation processes, ensuring systems meet defined performance benchmarks
Identify bottlenecks across networking, compute, and storage layers and drive technical improvements
Enhance system observability and performance monitoring practices across HPC environments
Requirements
5+ years of experience in system-level software engineering with a focus on performance optimization
3+ years of hands‑on experience with Linux systems administration, troubleshooting, and performance tuning
Strong understanding of server architecture, including PCIe, NICs, kernel‑level components, and HPC systems
Experience with high‑performance networking technologies such as InfiniBand or RoCE
Solid programming skills in C/C++, Go, or Python, with a performance‑oriented mindset
Strong analytical and debugging skills across distributed and low‑level systems
Experience working in complex, large‑scale infrastructure environments is highly desirable
Ability to collaborate with cross‑functional engineering teams and hardware vendors
Comfortable working in environments where coding assessments are part of the hiring process
Benefits
Competitive salary and comprehensive benefits package
Flexible working arrangements, including remote options
Opportunities for professional growth in a cutting‑edge AI infrastructure environment
Exposure to large‑scale GPU clusters and advanced HPC technologies
Collaborative, engineering‑driven culture focused on innovation and technical excellence
Work alongside global experts in AI, systems engineering, and hardware development
High‑impact role contributing directly to next‑generation AI compute platforms
#J-18808-Ljbffr