Job Description
Are you passionate about helping to mature technical practices and empowering teams to run resilient and reliable services? At Genomics England we are looking for a Principal Site Reliability Engineer to help lead and refocus our small SRE capability, ultimately growing it to become thought-leaders in system reliability across the organisation.
About the Role
As the Principal Site Reliability Engineer, you will be a hands-on contributor with exemplary platform engineering skills, who also has an ability to lead and think strategically to identify and prioritise areas with the greatest impact. You may have previously worked in a variety or roles and job titles within engineering, and you may have worked in a variety of organisational contexts. Whatever your past experience, it will have given you a deep understanding of SRE principles and practices and how these are used to build and operate reliable services that exceed customer expectations.
You will be a problem-solver who identifies risks, issues, gaps, and dependencies and brings people together to find solutions. You will do this through your supportive, empathetic and collaborative behaviours - acting as coach, mentor, guide or constructive questioner as the situation demands. You are pragmatic but also mindful of the big picture - consciously balancing the immediate goals of teams against the long-term direction for their products. You will be a great communicator, comfortable not just leading your own team, but also engaging across the engineering community and with non-technical stakeholders.
About the Tech Stack
The SRE team will support squads that run a variety of services: most of these are either user-facing web applications (React), backend APIs (Python), bioinformatics pipelines (NextFlow), or data ETL workflows (Prefect, Dremio). These services increasingly run in AWS, though there is still a significant on-premise presence, and they run in a mixture of compute environments, from ECS/Fargate to HPC clusters to (occasionally) Kubernetes.
Within the SDLC we have a standard toolchain which includes Terraform for infrastructure-as-code, GitLab for source code and CI/CD, Artifactory for software artefacts, and DataDog for observability. We are working to become interoperable with the wider NHS via open standards like FHIR and GA4GH APIs and increasingly aiming to integrate with their own API Management platform.
Job Description
Assess the current state of human and technical processes and practices from a site reliability perspective, and identify the areas of greatest concern
Develop a roadmap of initiatives aimed at raising the maturity of live services through SRE practices (SLOs, Critical User Journeys, monitoring with Golden Signals, release-engineering etc.)
Grow and lead a small team of Site Reliability Engineers to roll out those initiatives across the organisation through partnership with product squads
Work with key stakeholders to establish and embed key SRE principles (e.g. Error Budget Policies)
Maintain close contact with Engineering leadership and other "enabling" teams (Test Enablement, Developer Platform) to ensure that SRE work is fully aligned with our direction of travel
Stay abreast of emerging technologies and industry trends, and incorporate them into our software development practices
Contribute to the wider conversation at Genomics England and help mature our technical practices
Qualifications
While we recognise the value of relevant qualifications or certifications, we are primarily interested in your real-world experience.
Essential Skills and Experience
Comprehensive knowledge of SRE principles and practices with significant experience of applying these to real-world situations
Excellent software engineering skills especially in the context of release automation and other toil-eliminating activities (Python preferred, polyglot ideal)
Strong understanding of how architecture and other factors contribute to the overall resilience of systems
Extensive experience of platform engineering across CI/CD, Infrastructure as Code, operational monitoring and alerting, backup and recovery etc.
Experience in at least one major public cloud (AWS preferred but not essential)
Demonstrable ability to lead teams, manage, direct, mentor and plan work
Strong interpersonal skills with a temperament that builds trust and connection within and across squads through open, honest communication
Comfortable engaging responsively with teams both remotely and in person when required
Ability to navigate rapidly to effective solutions through engaged and inclusive listening, clarity of thought, clear documentation, and succinct presentation
Desirable Experience
These skills are not essential but if you have either of them they may prove to be useful:
Background in healthcare or bioinformatics
Experience in regulated environments
If you’re an experienced Site Reliability Engineer leader, who thrives on working collaboratively to mature engineering practices, we’d love to hear from you. Join us at Genomics England and make a meaningful impact in the world of genomics.