Principal Engineer (Site Reliability / SRE)

at Genomics England

Contractual Featured

Job Description Are you passionate about helping to mature technical practices and empowering teams to run resilient and reliable services? At Genomics England we are looking for a Principal Site Reliability Engineer to help lead and refocus our small SRE capability, ultimately growing it to become thought-leaders in system reliability across the organisation. About the Role As the Principal Site Reliability Engineer, you will be a hands-on contributor with exemplary platform engineering skills, who also has an ability to lead and think strategically to identify and prioritise areas with the greatest impact. You may have previously worked in a variety or roles and job titles within engineering, and you may have worked in a variety of organisational contexts. Whatever your past experience, it will have given you a deep understanding of SRE principles and practices and how these are used to build and operate reliable services that exceed customer expectations. You will be a problem-solver who identifies risks, issues, gaps, and dependencies and brings people together to find solutions. You will do this through your supportive, empathetic and collaborative behaviours - acting as coach, mentor, guide or constructive questioner as the situation demands. You are pragmatic but also mindful of the big picture - consciously balancing the immediate goals of teams against the long-term direction for their products. You will be a great communicator, comfortable not just leading your own team, but also engaging across the engineering community and with non-technical stakeholders. About the Tech Stack The SRE team will support squads that run a variety of services: most of these are either user-facing web applications (React), backend APIs (Python), bioinformatics pipelines (NextFlow), or data ETL workflows (Prefect, Dremio). These services increasingly run in AWS, though there is still a significant on-premise presence, and they run in a mixture of compute environments, from ECS/Fargate to HPC clusters to (occasionally) Kubernetes. Within the SDLC we have a standard toolchain which includes Terraform for infrastructure-as-code, GitLab for source code and CI/CD, Artifactory for software artefacts, and DataDog for observability. We are working to become interoperable with the wider NHS via open standards like FHIR and GA4GH APIs and increasingly aiming to integrate with their own API Management platform. Job Description Assess the current state of human and technical processes and practices from a site reliability perspective, and identify the areas of greatest concern Develop a roadmap of initiatives aimed at raising the maturity of live services through SRE practices (SLOs, Critical User Journeys, monitoring with Golden Signals, release-engineering etc.) Grow and lead a small team of Site Reliability Engineers to roll out those initiatives across the organisation through partnership with product squads Work with key stakeholders to establish and embed key SRE principles (e.g. Error Budget Policies) Maintain close contact with Engineering leadership and other "enabling" teams (Test Enablement, Developer Platform) to ensure that SRE work is fully aligned with our direction of travel Stay abreast of emerging technologies and industry trends, and incorporate them into our software development practices Contribute to the wider conversation at Genomics England and help mature our technical practices Qualifications While we recognise the value of relevant qualifications or certifications, we are primarily interested in your real-world experience. Essential Skills and Experience Comprehensive knowledge of SRE principles and practices with significant experience of applying these to real-world situations Excellent software engineering skills especially in the context of release automation and other toil-eliminating activities (Python preferred, polyglot ideal) Strong understanding of how architecture and other factors contribute to the overall resilience of systems Extensive experience of platform engineering across CI/CD, Infrastructure as Code, operational monitoring and alerting, backup and recovery etc. Experience in at least one major public cloud (AWS preferred but not essential) Demonstrable ability to lead teams, manage, direct, mentor and plan work Strong interpersonal skills with a temperament that builds trust and connection within and across squads through open, honest communication Comfortable engaging responsively with teams both remotely and in person when required Ability to navigate rapidly to effective solutions through engaged and inclusive listening, clarity of thought, clear documentation, and succinct presentation Desirable Experience These skills are not essential but if you have either of them they may prove to be useful: Background in healthcare or bioinformatics Experience in regulated environments If you’re an experienced Site Reliability Engineer leader, who thrives on working collaboratively to mature engineering practices, we’d love to hear from you. Join us at Genomics England and make a meaningful impact in the world of genomics.

Salary

Competitive

Project Basis based

Remote Job

Worldwide

Job Overview

Job Posted:

2 days ago

Job Expire:

1w 4d

Job Type

Contractual

Job Role

Any

Education

Any

Experience

Any

Total Vacancies

Location

United Kingdom

Salary

Competitive

Remote Job

Company

Candidate

Employer

Support

Job Details

Salary

Competitive

Remote Job

Share This Job:

Related Jobs

Company

Candidate

Employer

Support