Overview
We are seeking an experienced Site Reliability Engineer (SRE) to ensure the availability and performance of our cross-domain services used within high-profile government organizations. In this role, you will collaborate with various development teams and support teams to enhance our infrastructure and delivery pipelines while improving system observability and mitigating reliability risks. The ideal candidate will have a robust understanding of cloud hosting services and infrastructure management tools.
Responsibilities
- Design and maintain reliable, scalable physical and virtual infrastructure.
- Monitor system performance and proactively resolve issues.
- Automate processes using tools such as Ansible to improve efficiency and consistency.
- Collaborate with engineers and stakeholders across the business.
- Support continuous improvement of systems, tools, and practices.
- Operate across the full infrastructure stack, from bare metal systems to virtual deployments.
Requirements
- Experience using modern configuration management tools (Ansible, Chef, or similar).
- Experience working with Terraform.
- Experience with docker containers and orchestration tools (Kubernetes, OpenShift, or Docker Swarm).
- Experience with CI/CD tools (Jenkins or similar).
- Familiarity with monitoring tools (InfluxDB, Prometheus, or Grafana).
- Good understanding of relational databases and SQL.
- Linux command line, administration, and shell scripting skills.
- Experience with cloud hosting services (ideally AWS EC2, RDS, S3, Lambda).