Overview
We are seeking an experienced Site Reliability Engineer (SRE) with a focus on Observability to join a leading Wealth/Asset Management firm on a contract basis. The successful candidate will lead the implementation of a new observability solution, working collaboratively across digital platforms and engineering teams to enhance system reliability and performance. This remote role involves defining observability standards and driving data-driven decision-making to align with business priorities and digital objectives.
Responsibilities
- Define and drive the observability roadmap in alignment with business and platform objectives.
- Establish and monitor SLIs, SLOs, and error budgets to enhance system reliability.
- Design and implement observability runbooks covering various monitoring metrics.
- Assist engineering teams in implementing observability tools and monitoring systems.
- Promote best practices and governance of observability across teams.
- Collaborate with SRE, DevOps, and engineering teams for seamless integration of observability practices.
Requirements
- Minimum 10 years of engineering experience, with at least 5 years in SRE or Observability roles.
- Demonstrated experience in implementing observability solutions in cloud environments (AWS, Azure, GCP).
- Proficiency with observability tools such as Datadog, Grafana, Prometheus, and OpenTelemetry.
- Strong understanding of distributed systems, microservices, and container orchestration.
- Experience with automation tools such as Terraform and Ansible, along with CI/CD pipelines.
- Familiarity with performance engineering and telemetry-based insights.
- Proficient in programming/scripting languages like Python or Go.
- Knowledge of secure infrastructure practices, RBAC, and compliance requirements.