HPC Systems Administrator

Empire AI, Buffalo, NY, United States

Empire AI is establishing New York as the national leader in responsible artificial intelligence. Backed by a consortium of top academic and research institutions including Columbia University, Cornell University, NYU, CUNY, RPI, SUNY, University of Rochester, RIT, Mount Sinai, and Flatiron Institute.
By leveraging the state's rich academic resources and research institutions, Empire AI is driving innovation in fields like medicine, education, energy, and climate change — all while giving New York's researchers access to computing resources that are often prohibitively expensive and only available to big tech companies, fueling statewide innovation, driving economic growth, and preparing a future-ready AI workforce to tackle society's most complex challenges.
The initiative is funded by $500+ million in public and private investments, State Capital Grant, Academic Institutions, Simons Foundation, Flatiron Institute, and Tom Secunda (Co-Founder of Bloomberg).
Position Summary

The HPC Systems Administrator will administer, optimize, and support the high-performance computing platforms that power Empire AI's AI/ML workloads, scientific research, and large-scale simulation across its statewide consortium. Reporting to the Manager, AI/ML Systems Administration, this role is responsible for the day to day cluster operations, job scheduling, GPU resource management, and systems reliability of Empire AI's distributed HPC infrastructure.
This role ensures that Empire AI's shared computing environments remain available, performant, and accessible to researchers across partner institutions. The HPC Systems Administrator works at the intersection of systems administration, AI/ML infrastructure support, and research computing, bridging the gap between complex user workloads and the underlying HPC platform.
Duties and Responsibilities

Deploy, configure, and maintain Linux-based HPC clusters (Rocky/Ubuntu) at scale, including compute, GPU, storage, and management nodes
Administer and optimize Slurm workload manager including partition design, QOS policies, fair-share accounting, and cross-institutional workload orchestration models
Manage NVIDIA GPU resources (H100/H200/GB200) including driver, CUDA, firmware, and NCCL lifecycle management for AI training and inference workloads
Administer cluster management platforms such as NVIDIA Base Command Manager (BCM) for provisioning and system lifecycle management
Support containerized and virtualized research environments using Apptainer/Singularity, Pyxis and Enroot
Troubleshoot performance bottlenecks including MPI/NCCL collective traffic patterns and rail optimized topologies for LLM and AI workloads
Administer parallel file systems such as Lustre and Vast and integrate with cluster storage workflows
Establish incident alerting and escalation procedures for HPC cluster and infrastructure.
Manage detailed monitoring dashboards (Prometheus, Grafana) to track critical metrics: network throughput, GPU utilization, cluster health, and job telemetry.
AI/ML Infrastructure Support

Architect and support systems for AI training and inference pipelines, including large language models (LLMs) and multimodal AI workloads
Tune and benchmark systems for GPU-intensive AI/ML frameworks including PyTorch and TensorFlow
Work with research faculty to translate scientific goals into technical configurations and workload requirements
Evaluate emerging HPC hardware and software solutions, propose procurement recommendations aligned with AI/ML workload demands
Security & Compliance

Enforce security baselines, access control policies, and network segmentation across HPC environments
Integrate robust monitoring, alerting, access control, and disaster recovery planning into cluster operations
Partner with the Security & Compliance specialist to ensure security is integrated into system design and workload orchestration
Consult with research teams across consortium institutions to assess computational needs and advise on workflow optimization
Translate user feedback and researcher requirements into system-level improvements and configuration optimizations
Maintain clear system documentation, configuration guides, runbooks, and architecture diagrams
Minimum Qualifications

Bachelor's degree in Computer Science, Engineering, or a related technical field
5+ years of hands-on experience administering Linux-based HPC clusters in production environments, supporting research or scientific computing projects
Expertise with job schedulers (e.g., Slurm) and GPU computing
Familiarity with AI/ML frameworks, container environments (Apptainer/Singularity, Pyxis, Docker), and distributed storage systems
Working knowledge of InfiniBand networking (subnet management, UFM, opensm) and/or RoCEv2/Ethernet HPC fabrics
Proficiency in Bash and Python scripting for automation and systems administration
Experience with monitoring stacks: Prometheus, Grafana, or equivalent
Demonstrated success collaborating with researchers or supporting scientific computing projects
Preferred Qualifications

Experience with NVIDIA Base Command Manager (BCM), NVIDIA UFM, or DGX SuperPOD infrastructure
Familiarity with workload patterns and infrastructure needs for training, tuning, and deploying large-scale AI/ML models
Proficiency in infrastructure automation and system configuration tools: Ansible, Git
Experience supporting or collaborating within academic or industry research environments focused on artificial intelligence, machine learning, or large-scale data science

#J-18808-Ljbffr