
Senior Platform Engineer
Cadre5, Knoxville, TN, United States
Overview
Cadre5 provides innovative technical solutions to our customers locally and nationally. Our Cadre5 Lab Partners division has partnered with the National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory (ORNL) to recruit highly qualified individuals to play a key role in improving the security, performance, and reliability of the NCCS computing infrastructure which supports multiple highly ranked Top500 Supercomputers, including the world’s first exaflop system, Frontier.
ORNL delivers scientific discoveries and technical breakthroughs needed to realize solutions in energy and national security and provides economic benefit to the nation. This premier research institution located near Knoxville in Oak Ridge, TN, addresses national needs through impactful research and world-leading research centers.
Job Responsibilities
Work with the team to define and implement best practices and standards within the organization
Keeping the Kubernetes platform reliable, available, and fast
Architecting solutions to problems that improve the reliability, scalability, performance, and efficiency of our services
Respond to, investigate, and fix service issues all the way from bare metal through the OS to the application code
Coordinate with vendors to resolve hardware and software problems
Participate in an on-call rotation providing 24-hour, 7-day support and off-hours maintenance windows
Work with users to help them use Kubernetes
Basic Qualifications
Bachelor’s degree in a scientific field and a minimum of 8 years of relevant experience required. Equivalent combinations of education and experience will be considered.
Candidates with a Bachelor’s degree and 5+ years of relevant experience may also be considered based on qualifications
Experience managing and maintaining on-premises cluster infrastructure
Excellent interpersonal/communications skills, and the ability to work as part of a team
Strong working knowledge of Linux systems fundamentals and networked computing environment concepts
Experience with code reviews, code quality, CI/CD tooling, GitOps, SCM (e.g. GitLab)
Ability to identify requirements and to define, plan, and implement requisite solutions for small and medium projects
Ability to develop and maintain programs and scripts that aid in the operation and automation of tasks using various shell and scripting languages (primarily bash, Python, and Go)
Experience with on-call rotation
The ability to obtain and maintain a Department of Energy "Q" clearance is required. This requires US Citizenship.
Preferred Qualifications
Experience with Kubernetes as a cluster administrator for on-premises deployments
Subject matter expert in Kubernetes as a cluster administrator for bare metal, on-premises deployments
Excellent interpersonal/communications skills, be able to effectively communicate with other teams and organizational leadership. Convey technical details to a non-or semi-technical audience.
Ability to identify requirements and to define, plan, and implement requisite solutions for large, organizationally impactful projects.
Self-driven with the ability to work in a dynamic, loosely structured research & development environment.
Experience with RKE2 (nice to haves: Red Hat OpenShift and Talos). Multi-cluster management tools for Kubernetes (e.g. Fleet), and container security tools (Neuvector, SCC, pod admission control)
Experiencing with managing image registries such as Quay or Harbor
Experience using tools such as Prometheus, Nagios, and Grafana to monitor systems, metrics and create dashboards
Experience designing and implementing highly-available systems/services
Experience with Infrastructure-as-Code tooling such as Terraform, Helm, and Puppet
Experience implementing systems-level security technologies (e.g. SELinux, Seccomp, linux capabilities), experience with DevSecOps, and general security best practices.
Experience with AIOps and MLOps tooling – e.g. KServe, Kubeflow, vLLM, NVidia Enterprise AI, AMD Silo AI, ClearML, MLFlow
Experience using HPC hardware for Kubernetes – e.g. RDMA, DPUs, Infiniband, many-core CPUs
Experience with declarative CI/CD tools such as ArgoCD
Experience with workflow engines such as Apache Airflow or Argo Workflows
Experience with infrastructure automation
Cloud engineering experience with at least one cloud service provider
Experience with reusable, automated workflows such as PagerDuty playbooks
Cadre5 offers excellent pay and benefits, to include full medical, dental, and vision coverage coupled with 401K match, 15 days PTO, and 10 holidays.
Cadre5 is an equal opportunity employer. All qualified applicants, including individuals with disabilities and protected veterans, are encouraged to apply. Cadre5 is an E-Verify Employer.
#J-18808-Ljbffr
Cadre5 provides innovative technical solutions to our customers locally and nationally. Our Cadre5 Lab Partners division has partnered with the National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory (ORNL) to recruit highly qualified individuals to play a key role in improving the security, performance, and reliability of the NCCS computing infrastructure which supports multiple highly ranked Top500 Supercomputers, including the world’s first exaflop system, Frontier.
ORNL delivers scientific discoveries and technical breakthroughs needed to realize solutions in energy and national security and provides economic benefit to the nation. This premier research institution located near Knoxville in Oak Ridge, TN, addresses national needs through impactful research and world-leading research centers.
Job Responsibilities
Work with the team to define and implement best practices and standards within the organization
Keeping the Kubernetes platform reliable, available, and fast
Architecting solutions to problems that improve the reliability, scalability, performance, and efficiency of our services
Respond to, investigate, and fix service issues all the way from bare metal through the OS to the application code
Coordinate with vendors to resolve hardware and software problems
Participate in an on-call rotation providing 24-hour, 7-day support and off-hours maintenance windows
Work with users to help them use Kubernetes
Basic Qualifications
Bachelor’s degree in a scientific field and a minimum of 8 years of relevant experience required. Equivalent combinations of education and experience will be considered.
Candidates with a Bachelor’s degree and 5+ years of relevant experience may also be considered based on qualifications
Experience managing and maintaining on-premises cluster infrastructure
Excellent interpersonal/communications skills, and the ability to work as part of a team
Strong working knowledge of Linux systems fundamentals and networked computing environment concepts
Experience with code reviews, code quality, CI/CD tooling, GitOps, SCM (e.g. GitLab)
Ability to identify requirements and to define, plan, and implement requisite solutions for small and medium projects
Ability to develop and maintain programs and scripts that aid in the operation and automation of tasks using various shell and scripting languages (primarily bash, Python, and Go)
Experience with on-call rotation
The ability to obtain and maintain a Department of Energy "Q" clearance is required. This requires US Citizenship.
Preferred Qualifications
Experience with Kubernetes as a cluster administrator for on-premises deployments
Subject matter expert in Kubernetes as a cluster administrator for bare metal, on-premises deployments
Excellent interpersonal/communications skills, be able to effectively communicate with other teams and organizational leadership. Convey technical details to a non-or semi-technical audience.
Ability to identify requirements and to define, plan, and implement requisite solutions for large, organizationally impactful projects.
Self-driven with the ability to work in a dynamic, loosely structured research & development environment.
Experience with RKE2 (nice to haves: Red Hat OpenShift and Talos). Multi-cluster management tools for Kubernetes (e.g. Fleet), and container security tools (Neuvector, SCC, pod admission control)
Experiencing with managing image registries such as Quay or Harbor
Experience using tools such as Prometheus, Nagios, and Grafana to monitor systems, metrics and create dashboards
Experience designing and implementing highly-available systems/services
Experience with Infrastructure-as-Code tooling such as Terraform, Helm, and Puppet
Experience implementing systems-level security technologies (e.g. SELinux, Seccomp, linux capabilities), experience with DevSecOps, and general security best practices.
Experience with AIOps and MLOps tooling – e.g. KServe, Kubeflow, vLLM, NVidia Enterprise AI, AMD Silo AI, ClearML, MLFlow
Experience using HPC hardware for Kubernetes – e.g. RDMA, DPUs, Infiniband, many-core CPUs
Experience with declarative CI/CD tools such as ArgoCD
Experience with workflow engines such as Apache Airflow or Argo Workflows
Experience with infrastructure automation
Cloud engineering experience with at least one cloud service provider
Experience with reusable, automated workflows such as PagerDuty playbooks
Cadre5 offers excellent pay and benefits, to include full medical, dental, and vision coverage coupled with 401K match, 15 days PTO, and 10 holidays.
Cadre5 is an equal opportunity employer. All qualified applicants, including individuals with disabilities and protected veterans, are encouraged to apply. Cadre5 is an E-Verify Employer.
#J-18808-Ljbffr