Senior Site Reliability Engineer (Closed/Filled)

Location London
Discipline: Site Reliability Engineer, DevOps, SRE and Platform Engineering
Published: over 1 year ago

Are you fascinated by big numbers and driving performance metrics?

We are working with an online travel booking platform obsessed with its numbers and looking for Senior Site Reliability Engineer with the same fascination for improving them even further.

Where are they on their journey? They have a million website visitors daily, over 100 microservices, and a p95 search latency of 150ms; they capture 1TB of logs on Loki and take 350,000 samples a second. They also deploy to production 500 times a month. Sexy stuff, right? 

What is the role?

In this newly created role, the Senior Site Reliability Engineer (SRE) will contribute to building reliable, performant, auto-scalable and highly available/low latency systems with the support of the existing Platform Infrastructure team.

So, how can you help to balance reliability with feature delivery by developing tools, improving alerting and observability, and exposing slow-running code paths?

Key areas of focus will be the ongoing evolution of SRE best practices like incident management, blameless postmortems, Service Level Objective (SLOs) and their error budgets.

By improving their performance testing, you'll also ensure the services can withstand the goal of ten times the peak load.

What skills would you need to do this role?
Some of the key skills for this role will be being able to use tools like Java Flight Recorder or Go's Pprofessions to expose slow-running code paths in critical applications, improve performance testing, shorten the discovery and recovery time, as well as expose system weaknesses with Chaos Engineering.

Your role will be to help engineering teams own the entire services lifecycle from the first commit to high-load operation in production. Your job will be to work with teams to build tools, identify bottlenecks, improve observability and alerting, and spot slow-running code paths in critical applications that are causing lags.

What tech will you use?

Cloud is GCP, and Terraform is the Infrastructure as Code, plus then some favourites from Grafana Labs (Loki, Mimir, Tempo) and good old Kubernetes and Prometheus.  

Where will you work?

This team is big on in-person collaboration, and the role is based in West London, with two days a week (Tuesday and Wednesday) spent in the office with the Platform Infrastructure and Developer teams. 

Want to know more? 

All you need to do is send us your CV, if you don’t have one ready then send a PDF of your LinkedIn profile; if you are not on LinkedIn start a word document and write a few lines about what you are working on now and why you think this might be interesting. We don't care how you get in touch; we hope you do!