Overview
The Site Reliability Engineer will play a key role in ensuring the availability, performance, and resilience of critical infrastructure services. Working in a hybrid environment, you will collaborate with software engineers and system administrators to enhance system observability, automate processes, and address reliability challenges. This position offers the chance to work on high-impact projects within a technically sophisticated team committed to delivering exceptional service.
Responsibilities
- Partner with software engineers to improve system reliability and performance.
- Automate processes in collaboration with system administrators to reduce manual tasks.
- Develop and enhance monitoring, logging, and observability tools to proactively identify issues.
- Support the ongoing improvement of development environments to meet delivery and quality standards.
- Research and implement new tools and architectures to enhance scalability and resilience.
- Build and maintain CI/CD pipelines to streamline deployment processes.
- Expand expertise across both cloud-based and on-premises environments.
Requirements
- Proven experience with Terraform and modern configuration management tools (e.g., Ansible, Chef).
- Strong skills in Docker and Kubernetes/OpenShift/Docker Swarm.
- Hands-on experience with CI/CD pipeline development (e.g., Jenkins).
- Deep knowledge of monitoring and observability tools (e.g., Grafana, Prometheus, InfluxDB).
- Solid understanding of Linux, network security, SQL, and AWS services (e.g., EC2, S3, RDS, Lambda).
- Familiarity with message queueing systems (e.g., RabbitMQ).
- Experience with Azure environments is a plus.
- Strong programming ability in languages such as Python, Java, or Go is preferred.