
HPC System Administrator
Ieeerusb, Santa Clara, CA, United States
HPC System Administrator
Position Title:
HPC System Administrator
Position Type:
Regular
Hiring Range:
$129,000 - $161,265 annually
Pay Frequency:
Annual
POSITION PURPOSE
The High‑Performance Computing (HPC) System Administrator is an expert, hands‑on role responsible for the design, configuration, optimization, and operation of the organization's HPC infrastructure. The role focuses on advanced system optimization, complex troubleshooting, and strategic planning for future enhancements across compute, storage, and high‑speed interconnects (InfiniBand). Key responsibilities include mentoring and cross‑training existing system administrators to build collective HPC expertise, strengthen shared support capabilities, and ensure long‑term operational resilience and efficiency.
ESSENTIAL DUTIES AND RESPONSIBILITIES
1. HPC Infrastructure Management and Optimization
Compute: Manage the entire lifecycle of all compute nodes, including procurement, installation, configuration, and maintenance of hardware, operating systems, and core software to ensure optimal performance and resource utilization for scientific workloads.
Storage: Direct management of high‑performance parallel file systems (e.g., Lustre, GPFS), NAS, and backup solutions, performing capacity planning, performance tuning, and integrity checks to guarantee secure, high‑speed, and reliable data access for all users.
InfiniBand: Design, deploy, and provide expert‑level troubleshooting and maintenance for the InfiniBand high‑speed interconnect fabric, ensuring low‑latency, high‑bandwidth inter‑node communication essential for scalable HPC application performance.
2. Workload Management and System Deployment
Slurm: Administer, configure, and tune the Slurm Workload Manager, managing job queues, partitions, and resource allocation policies to enforce fair‑share scheduling and maximize cluster utilization.
System Imaging: Develop, maintain, and update standardized, optimized system images for all compute nodes using automation tools to facilitate rapid, consistent deployment, efficient patching, and streamlined upgrades.
Software Licenses: Oversee administration and compliance of all commercial scientific software licenses, ensuring adherence to vendor agreements and strategically managing license servers and usage policies.
3. Team Development and Strategic Planning
Knowledge Transfer: Develop and implement a formal cross‑training program for existing system administrators by creating documentation and delivering hands‑on instruction to enhance team expertise in HPC‑specific technologies.
Operational Resilience: Ensure robust shared support capabilities across the IT team by strategically transferring HPC knowledge, preventing single points of failure, and improving overall efficiency and responsiveness.
Strategic Enhancement: Contribute to the planning and roadmap development for future HPC infrastructure and software enhancements by researching emerging technologies and providing expert recommendations.
4. Coordination and Collaboration
Participate in architecture brainstorming and design discussions with technical team members, providing expertise during planning and implementation phases of new technologies.
Provide technical guidance on complex infrastructure architecture challenges to IS team members and other solution partners.
Coach and develop new team members on customer service and problem‑solving approaches.
Model and support team members in conducting themselves with openness, honesty, and trust.
5. Resource Planning
Provide input on setting Enterprise Systems and CIT goals, objectives, and strategies based on the University's mission.
Contribute to technology planning processes to develop cost‑effective customer‑focused solutions.
Plan and lead projects and working groups using strong technical and organizational knowledge.
6. Service Delivery
Collaborate with the Enterprise Systems Manager in planning, maintenance, and secure expansion of SCU's computing infrastructure.
Apply architecture principles consistently across data center compute and storage services.
Work with the Information Security Office to ensure a secure and compliant enterprise environment.
Ensure appropriate distribution of infrastructure services to faculty, staff, and students.
Create and document standards and practices for data center, compute, and storage services.
Oversee creation and performance of infrastructure production and test environments.
Support assigned systems with on‑call availability and respond within agreed timeframes.
Implement and maintain standard routines for patching, updates, and firmware for physical, virtual, and hosted systems.
Participate in backup operations, disaster recovery, and business continuity planning.
Perform daily system monitoring to verify hardware, server resources, and key processes.
Establish upgrade and update schedules, maintenance windows, and keep systems at current release levels.
Liaise with hosted platform and third‑party providers to monitor service level agreements.
7. Service Optimization
Enhance existing architecture frameworks to define and implement simplified, standards‑based system architectures.
Assist in designing and implementing infrastructure optimization and process improvement projects.
Test and assess infrastructure against benchmarks to ensure optimal performance.
Participate in IT and information security audits and prioritize corrective actions.
Participate in change management to ensure all changes are documented, tested, and deployed with back‑out strategies.
Incorporate industry trends to improve services and reduce costs.
8. Communication
Effectively communicate complex data analyses to provide technical and strategic input during project planning.
Maintain regular communication with Cyberinfrastructure Technologies colleagues regarding initiatives.
Keep the Enterprise Systems Manager informed of current and potential issues, activities, operational outages, and risks.
9. Operations
Suggest operation strategies for major shifts in customer needs.
Determine procedures for maintaining data center servers and systems in reliable operation.
Provide expertise and technical support for faculty and student technology adoption.
Support compute and storage needs of institutional programs, building robust solutions.
Empower end users to use technology successfully.
Interface with vendors and external resources.
Evaluate new software or systems under consideration for adoption.
Maintain asset management procedures.
Automate and streamline procedures within the department.
Be on‑call as required.
Use programming, scripting, and diagnostic tools to support infrastructure.
10. Other Duties
Other duties as assigned by the Manager of Enterprise Systems and IS leadership.
Q. PROVIDES WORK DIRECTION
May supervise student workers.
R. GENERAL GUIDELINES
Recommend initiatives and implement changes to improve quality and services.
Identify and determine causes of problems; develop and present improvement recommendations.
Maintain contact with customers to solicit feedback for improved services.
Maximize productivity through appropriate tools; plan training and performance initiatives.
Research and develop resources for efficient workflow.
Prepare progress reports and inform supervisor of project status and deviations from goals.
Prepare and submit reports as requested.
Develop and implement guidelines to support unit functions.
QUALIFICATIONS
1. Knowledge, Skills and Abilities
General
Knowledge of IT, campus technology, and information security issues and trends in higher education.
Listen and understand customer needs.
Plan, implement, and evaluate customer service initiatives.
Work collaboratively in a team to meet deadlines and achieve goals.
Manage a diverse workforce to provide excellent customer service.
Self‑motivated and shows initiative.
Manage multiple projects simultaneously.
Project planning and management experience.
Exercise independent judgment and critical thinking.
Work effectively under pressure in a busy environment.
Explain technical issues and policies to non‑technical people.
Give presentations on technical issues to a broad range of audiences.
Foster and maintain relationships with faculty, administrators, students, and leaders.
Handle sensitive matters with diplomacy and mediation.
Maintain confidentiality and handle confidential information.
Possess impeccable integrity.
Speak truth to power.
Appreciate the University’s mission, vision, values, priorities, procedures, and policies.
Position‑specific
Knowledgeable and experienced in large‑scale computer center operations with Linux and Windows Server systems.
Experience managing SAN storage environments.
Strong proficiency managing multi‑platform hardware and software environments (Microsoft, Linux).
Experience with configuration management tools such as Ansible.
Experience with Slurm and job scheduling.
Proficiency in scripting languages (Python, Shell, Perl).
Experience compiling software packages and managing modules in HPC (EasyBuild, Lmod).
Experience racking servers and adding PCI cards.
Experience with LDAP and DNS.
Experience with parallel file systems.
Experience with InfiniBand networks.
Experience with vSphere, ESXi.
Experience configuring system monitoring tools.
Experience with enterprise backups and cloud providers.
Skilled technical troubleshooter who can analyze and solve complex problems.
Knowledge of personal computer use and standard productivity tools.
Experience interacting and working with others in a customer‑service capacity.
Knowledge of industry trends in enterprise infrastructure.
Experience with Identity and Access Management.
Excellent interpersonal, written, and verbal communication skills.
Demonstrated ability to work in a team environment.
Strong organizational skills and multitasking ability.
Self‑starter who proactively identifies and resolves problems.
Ability to acquire and apply new skills quickly.
Strong customer service orientation.
Understands the role of enterprise computing in University business processes.
Works under limited supervision.
2. Education
Bachelor’s degree in a directly applicable field (Computer/Electrical Engineering, Math/Computer Science, Operations and Management Information Science).
Advanced degree preferred.
3. Experience
8+ years applicable experience in operation, maintenance, support, and design of enterprise‑wide computer center systems with increasing responsibilities.
2+ years of experience supporting an HPC environment, including experience with Slurm, InfiniBand, and Lustre or similar parallel file systems.
Experience working in higher education or research organizations is desirable.
PHYSICAL DEMANDS
Operate equipment in racks, tall shelves, or cabinets.
Climb ladders and work at heights.
Work in confined spaces.
Spend considerable time at a desk using a computer terminal.
Travel to other campus buildings.
Occasional travel to remote campuses or vendor locations.
Attend conferences or training sessions within the Bay Area or elsewhere.
WORK ENVIRONMENT
Typical office and computer lab environment.
Mostly indoor office environment with windows.
Offices with equipment noise and frequent interruptions.
Data centers with wiring, equipment, tight spaces, and low light.
Raised floor access and above‑ceiling spaces.
High‑wall or basement locations for equipment.
EEO STATEMENT
Equal Opportunity/Notice of Nondiscrimination. Santa Clara University is an equal opportunity employer. All qualified applicants are encouraged to apply and will receive consideration for employment without regard to race, color, ethnicity, national origin, citizenship, ancestry, religion, age, sex, sexual orientation, gender, gender expression, gender identity, marital status, parental status, veteran or military status, physical or mental disability, medical conditions, pregnancy or related conditions, reproductive health decision making, or any other characteristic protected by federal, state, or local laws.
#J-18808-Ljbffr
Position Title:
HPC System Administrator
Position Type:
Regular
Hiring Range:
$129,000 - $161,265 annually
Pay Frequency:
Annual
POSITION PURPOSE
The High‑Performance Computing (HPC) System Administrator is an expert, hands‑on role responsible for the design, configuration, optimization, and operation of the organization's HPC infrastructure. The role focuses on advanced system optimization, complex troubleshooting, and strategic planning for future enhancements across compute, storage, and high‑speed interconnects (InfiniBand). Key responsibilities include mentoring and cross‑training existing system administrators to build collective HPC expertise, strengthen shared support capabilities, and ensure long‑term operational resilience and efficiency.
ESSENTIAL DUTIES AND RESPONSIBILITIES
1. HPC Infrastructure Management and Optimization
Compute: Manage the entire lifecycle of all compute nodes, including procurement, installation, configuration, and maintenance of hardware, operating systems, and core software to ensure optimal performance and resource utilization for scientific workloads.
Storage: Direct management of high‑performance parallel file systems (e.g., Lustre, GPFS), NAS, and backup solutions, performing capacity planning, performance tuning, and integrity checks to guarantee secure, high‑speed, and reliable data access for all users.
InfiniBand: Design, deploy, and provide expert‑level troubleshooting and maintenance for the InfiniBand high‑speed interconnect fabric, ensuring low‑latency, high‑bandwidth inter‑node communication essential for scalable HPC application performance.
2. Workload Management and System Deployment
Slurm: Administer, configure, and tune the Slurm Workload Manager, managing job queues, partitions, and resource allocation policies to enforce fair‑share scheduling and maximize cluster utilization.
System Imaging: Develop, maintain, and update standardized, optimized system images for all compute nodes using automation tools to facilitate rapid, consistent deployment, efficient patching, and streamlined upgrades.
Software Licenses: Oversee administration and compliance of all commercial scientific software licenses, ensuring adherence to vendor agreements and strategically managing license servers and usage policies.
3. Team Development and Strategic Planning
Knowledge Transfer: Develop and implement a formal cross‑training program for existing system administrators by creating documentation and delivering hands‑on instruction to enhance team expertise in HPC‑specific technologies.
Operational Resilience: Ensure robust shared support capabilities across the IT team by strategically transferring HPC knowledge, preventing single points of failure, and improving overall efficiency and responsiveness.
Strategic Enhancement: Contribute to the planning and roadmap development for future HPC infrastructure and software enhancements by researching emerging technologies and providing expert recommendations.
4. Coordination and Collaboration
Participate in architecture brainstorming and design discussions with technical team members, providing expertise during planning and implementation phases of new technologies.
Provide technical guidance on complex infrastructure architecture challenges to IS team members and other solution partners.
Coach and develop new team members on customer service and problem‑solving approaches.
Model and support team members in conducting themselves with openness, honesty, and trust.
5. Resource Planning
Provide input on setting Enterprise Systems and CIT goals, objectives, and strategies based on the University's mission.
Contribute to technology planning processes to develop cost‑effective customer‑focused solutions.
Plan and lead projects and working groups using strong technical and organizational knowledge.
6. Service Delivery
Collaborate with the Enterprise Systems Manager in planning, maintenance, and secure expansion of SCU's computing infrastructure.
Apply architecture principles consistently across data center compute and storage services.
Work with the Information Security Office to ensure a secure and compliant enterprise environment.
Ensure appropriate distribution of infrastructure services to faculty, staff, and students.
Create and document standards and practices for data center, compute, and storage services.
Oversee creation and performance of infrastructure production and test environments.
Support assigned systems with on‑call availability and respond within agreed timeframes.
Implement and maintain standard routines for patching, updates, and firmware for physical, virtual, and hosted systems.
Participate in backup operations, disaster recovery, and business continuity planning.
Perform daily system monitoring to verify hardware, server resources, and key processes.
Establish upgrade and update schedules, maintenance windows, and keep systems at current release levels.
Liaise with hosted platform and third‑party providers to monitor service level agreements.
7. Service Optimization
Enhance existing architecture frameworks to define and implement simplified, standards‑based system architectures.
Assist in designing and implementing infrastructure optimization and process improvement projects.
Test and assess infrastructure against benchmarks to ensure optimal performance.
Participate in IT and information security audits and prioritize corrective actions.
Participate in change management to ensure all changes are documented, tested, and deployed with back‑out strategies.
Incorporate industry trends to improve services and reduce costs.
8. Communication
Effectively communicate complex data analyses to provide technical and strategic input during project planning.
Maintain regular communication with Cyberinfrastructure Technologies colleagues regarding initiatives.
Keep the Enterprise Systems Manager informed of current and potential issues, activities, operational outages, and risks.
9. Operations
Suggest operation strategies for major shifts in customer needs.
Determine procedures for maintaining data center servers and systems in reliable operation.
Provide expertise and technical support for faculty and student technology adoption.
Support compute and storage needs of institutional programs, building robust solutions.
Empower end users to use technology successfully.
Interface with vendors and external resources.
Evaluate new software or systems under consideration for adoption.
Maintain asset management procedures.
Automate and streamline procedures within the department.
Be on‑call as required.
Use programming, scripting, and diagnostic tools to support infrastructure.
10. Other Duties
Other duties as assigned by the Manager of Enterprise Systems and IS leadership.
Q. PROVIDES WORK DIRECTION
May supervise student workers.
R. GENERAL GUIDELINES
Recommend initiatives and implement changes to improve quality and services.
Identify and determine causes of problems; develop and present improvement recommendations.
Maintain contact with customers to solicit feedback for improved services.
Maximize productivity through appropriate tools; plan training and performance initiatives.
Research and develop resources for efficient workflow.
Prepare progress reports and inform supervisor of project status and deviations from goals.
Prepare and submit reports as requested.
Develop and implement guidelines to support unit functions.
QUALIFICATIONS
1. Knowledge, Skills and Abilities
General
Knowledge of IT, campus technology, and information security issues and trends in higher education.
Listen and understand customer needs.
Plan, implement, and evaluate customer service initiatives.
Work collaboratively in a team to meet deadlines and achieve goals.
Manage a diverse workforce to provide excellent customer service.
Self‑motivated and shows initiative.
Manage multiple projects simultaneously.
Project planning and management experience.
Exercise independent judgment and critical thinking.
Work effectively under pressure in a busy environment.
Explain technical issues and policies to non‑technical people.
Give presentations on technical issues to a broad range of audiences.
Foster and maintain relationships with faculty, administrators, students, and leaders.
Handle sensitive matters with diplomacy and mediation.
Maintain confidentiality and handle confidential information.
Possess impeccable integrity.
Speak truth to power.
Appreciate the University’s mission, vision, values, priorities, procedures, and policies.
Position‑specific
Knowledgeable and experienced in large‑scale computer center operations with Linux and Windows Server systems.
Experience managing SAN storage environments.
Strong proficiency managing multi‑platform hardware and software environments (Microsoft, Linux).
Experience with configuration management tools such as Ansible.
Experience with Slurm and job scheduling.
Proficiency in scripting languages (Python, Shell, Perl).
Experience compiling software packages and managing modules in HPC (EasyBuild, Lmod).
Experience racking servers and adding PCI cards.
Experience with LDAP and DNS.
Experience with parallel file systems.
Experience with InfiniBand networks.
Experience with vSphere, ESXi.
Experience configuring system monitoring tools.
Experience with enterprise backups and cloud providers.
Skilled technical troubleshooter who can analyze and solve complex problems.
Knowledge of personal computer use and standard productivity tools.
Experience interacting and working with others in a customer‑service capacity.
Knowledge of industry trends in enterprise infrastructure.
Experience with Identity and Access Management.
Excellent interpersonal, written, and verbal communication skills.
Demonstrated ability to work in a team environment.
Strong organizational skills and multitasking ability.
Self‑starter who proactively identifies and resolves problems.
Ability to acquire and apply new skills quickly.
Strong customer service orientation.
Understands the role of enterprise computing in University business processes.
Works under limited supervision.
2. Education
Bachelor’s degree in a directly applicable field (Computer/Electrical Engineering, Math/Computer Science, Operations and Management Information Science).
Advanced degree preferred.
3. Experience
8+ years applicable experience in operation, maintenance, support, and design of enterprise‑wide computer center systems with increasing responsibilities.
2+ years of experience supporting an HPC environment, including experience with Slurm, InfiniBand, and Lustre or similar parallel file systems.
Experience working in higher education or research organizations is desirable.
PHYSICAL DEMANDS
Operate equipment in racks, tall shelves, or cabinets.
Climb ladders and work at heights.
Work in confined spaces.
Spend considerable time at a desk using a computer terminal.
Travel to other campus buildings.
Occasional travel to remote campuses or vendor locations.
Attend conferences or training sessions within the Bay Area or elsewhere.
WORK ENVIRONMENT
Typical office and computer lab environment.
Mostly indoor office environment with windows.
Offices with equipment noise and frequent interruptions.
Data centers with wiring, equipment, tight spaces, and low light.
Raised floor access and above‑ceiling spaces.
High‑wall or basement locations for equipment.
EEO STATEMENT
Equal Opportunity/Notice of Nondiscrimination. Santa Clara University is an equal opportunity employer. All qualified applicants are encouraged to apply and will receive consideration for employment without regard to race, color, ethnicity, national origin, citizenship, ancestry, religion, age, sex, sexual orientation, gender, gender expression, gender identity, marital status, parental status, veteran or military status, physical or mental disability, medical conditions, pregnancy or related conditions, reproductive health decision making, or any other characteristic protected by federal, state, or local laws.
#J-18808-Ljbffr