Overview
We are looking for a highly skilled High-Performance Computing (HPC) Engineer to join our team for a 6-month contract with potential for extension. The successful candidate will work closely with the scientific community to enhance computational workflows and maintain robust HPC infrastructure, employing DevOps principles and Infrastructure-as-Code methodologies. This hybrid role allows for three days on-site and two days remote, starting on November 24th.
Responsibilities
- Design, implement, and maintain secure and scalable HPC infrastructure using Infrastructure-as-Code tools such as Terraform.
- Develop, deliver, and support advanced research computing services and applications.
- Apply Site Reliability Engineering principles to ensure high availability and performance across HPC environments.
- Troubleshoot and resolve complex technical challenges affecting platform and user workloads.
Requirements
- 10+ years of hands-on experience in designing and operating large-scale computing environments (HPC, HTC, or Big Compute).
- Proven ability to drive innovation and integrate emerging technologies into HPC solutions.
- Administration experience with cluster and workload management software (e.g., Slurm, LSF, Grid Engine).
- Strong knowledge of Linux system administration, TCP/IP networking, and storage systems.
- Experience managing parallel file systems (e.g., Weka, GPFS, Lustre).
- Hands-on experience with private cloud platforms (e.g., OpenStack).
- Proficiency with configuration management tools (e.g., Ansible, Salt, Puppet).
- Strong scripting skills in Bash and Python for automation and systems management.