Overview
As a Site Reliability Engineer, you will play a crucial role in maintaining the availability and performance of secure, high-impact services. You will collaborate with both development and support teams to enhance infrastructure, automate processes, and improve system observability. This hybrid contract position offers you the chance to engage in complex challenges within a technically rich environment.
Responsibilities
- Partner with Software Engineers to enhance reliability and performance across complex systems.
- Collaborate with SysAdmins to automate toil and eliminate manual intervention.
- Build smarter monitoring, logging, and observability pipelines to detect and resolve issues early.
- Support and improve development environments to hit delivery and quality goals.
- Research new tools, services, and architectures to drive scalability and resilience.
- Expand expertise across both cloud and on-prem environments.
Requirements
- Proven experience with Terraform and modern configuration tools (Ansible, Chef, etc.).
- Skilled with Docker and Kubernetes/OpenShift/Docker Swarm.
- Hands-on experience building and maintaining CI/CD pipelines (e.g. Jenkins).
- Deep understanding of monitoring & observability tools (Grafana, Prometheus, InfluxDB).
- Solid grounding in Linux, network security, SQL, and AWS (EC2, S3, RDS, Lambda).
- Comfortable with MQ messaging (RabbitMQ or similar).