This role is being offered as hybrid working based at any of our Core HQ’s.
We offer great flexible working opportunities at UKHSA and operate using a hybrid working model where business needs allow. This provides us with greater flexibility about how and where we work, to get the best from our workforce. As a hybrid worker, you will be expected to spend a minimum of 60% of your contractual working hours (approximately 3 days a week pro rata, (averaged over a month) working at one of UKHSA's core HQ’s (Birmingham, Leeds, Liverpool, and London).
Our core HQ offices are modern and newly refurbished with excellent city centre transport link and benefit from benefit from co-location with other government departments such as the Department for Health and Social Care (DHSC).
Job Summary
The Digital and Data Directorate has primary responsibility for scientific and research computing services and support. The key functions of the Digital Development and Operations unit are to provide and support such platforms required by the staff of The UK Health Security Agency, and to provide the technical capabilities to enable public health services, both within the Organisation and between the Organisation and its customers and stakeholders.
As a Specialist Site Reliability Engineer (SRE) You Will:
- Remediate infrastructure and operational problems
- Leverage automation and Continuous Integration/Continuous Delivery (CI/CD); ensuring our services run reliably, are scalable, and perform optimally
- Monitor and manage these aspects while taking responsibility for multiple cloud infrastructure services
- Observing systems will be key to prioritising the operational service improvements and performance improvements to meet/exceed SLOs (Service Level Objectives)
The Role Will Be Responsible To The Principal Specialist Engineer SRE And Is Part Of The High Performance Computing, Site Reliability Engineering , Artificial Intelligence (HPC/SRE/AI) & Research Computing Unit Whose Remit Is To:
- Architect, develop & manage multi-cloud HPC platforms and on-premise infrastructure
- Ensure services are highly available, scalable and resilient
- Managing performance, capability and capacity planning
- Support UKHSA's AI requirements
This role attracts a Market Pay Supplement of up to £5,000.
Working for your organisation
We pride ourselves as being an employer of choice, where Everyone Matters promoting equality of opportunity to actively encourage applications from everyone, including groups currently underrepresented in our workforce.
UKHSA ethos is to be an inclusive organisation for all our staff and stakeholders. To create, nurture and sustain an inclusive culture, where differences drive innovative solutions to meet the needs of our workforce and wider communities. We do this through celebrating and protecting differences by removing barriers and promoting equity and equality of opportunity for all.
Please visit our careers site for more information https://gov.uk/ukhsa/careers
Job Description
We are seeking a highly motivated and experienced SRE to join our HPC & SRE engineering team. As an SRE, you will play a critical role in ensuring the stability, scalability, and performance of our services. You will combine software engineering and systems engineering to build, improve and run reliable, scalable production systems.
Key Responsibilities
Service Reliability & Performance
- Ensure services are stable, scalable, and performant through engineering best practices and system design.
- Proactively identify and address system bottlenecks using advanced problem-solving and performance tuning techniques.
- Conduct capacity planning and implement solutions to ensure systems can support current and future workloads.
Incident Response & Troubleshooting
- Respond swiftly to production incidents, ensuring minimal downtime and quick restoration of services.
- Perform root cause analysis and postmortems, implementing lessons learned to prevent recurrence.
Monitoring, Alerting & Observability
- Contribute to the design and implementation of effective monitoring and alerting systems using tools and dashboards.
- Improve observability of services, ensuring issues are identified and addressed before impacting users.
- Continuously refine monitoring practices to reduce alert fatigue and improve response times.
Automation & Tooling
- Develop automation to eliminate manual, repetitive tasks and improve operational efficiency.
- Write clear, maintainable, and well-tested code to support automation efforts and system tooling.
- Drive initiatives to reduce operational toil and improve reliability through Infrastructure as Code (IaC).
Service Level Objectives & Operational Improvements
- Contribute to the definition, tracking, and continuous improvement of SLOs, Service Level Indicator’s (SLIs), and error budgets.
- Identify and prioritize operational improvements that align with business goals and user experience.
SRE Best Practices & Advocacy
- Helping to evangelize SRE principles across the organization.
- Collaborate with stakeholders to integrate reliability practices into the development lifecycle.
Collaboration & Knowledge Sharing
- Work closely with software engineering, DevOps, and infrastructure teams to streamline deployment and operational workflows.
- Improve cross-functional collaboration and promote a culture of shared responsibility for service reliability.
Documentation & Training
- Maintain accurate technical documentation, runbooks, and post-incident reports.
- Provide training and mentorship to engineering teams on best practices and tools.
Main duties of the job
- Ensure services are stable, scalable, performant and automated.
- Respond to incidents, troubleshooting issues, and restoring services as quickly as possible.
- Prioritise operational service improvements to meet or increase SLO, minimising downtime.
- Ensure that effective monitoring/alerting is in place to proactively identify issues using tools and dashboards. Reducing times to respond to issues.
- Leverage automation to streamline tasks, reduce overhead on repeatable operations, reduce manual intervention and improve efficiency. Write code that is maintainable, clear, and concise.
- Optimise system performance using strong problem-solving skills to identify bottlenecks with an engineering mindset.
- Ensure systems can handle current and future workloads through automation and capacity planning.
- Continuously improve services through observability, and identify ways to improve observability practices.
- Follow SRE principles. Guide and educate stakeholders to adopt implemented principles.
- Provide technical documentation for engineers. Providing training, where appropriate.
- Working closely with engineering and technology teams to improve operational processes, reduce manual tasks, ensure seamless collaboration/knowledge sharing, reduce risks and adapt to new ways of working.
This list is not exhaustive.
We are seeking a highly motivated and experienced SRE to join our HPC & SRE engineering team. As an SRE, you will play a critical role in ensuring the stability, scalability, and performance of our services. You will combine software engineering and systems engineering to build, improve and run reliable, scalable production systems.
Key Responsibilities
Service Reliability & Performance
- Ensure services are stable, scalable, and performant through engineering best practices and system design.
- Proactively identify and address system bottlenecks using advanced problem-solving and performance tuning techniques.
- Conduct capacity planning and implement solutions to ensure systems can support current and future workloads.
Incident Response & Troubleshooting
- Respond swiftly to production incidents, ensuring minimal downtime and quick restoration of services.
- Perform root cause analysis and postmortems, implementing lessons learned to prevent recurrence.
Monitoring, Alerting & Observability
- Contribute to the design and implementation of effective monitoring and alerting systems using tools and dashboards.
- Improve observability of services, ensuring issues are identified and addressed before impacting users.
- Continuously refine monitoring practices to reduce alert fatigue and improve response times.
Automation & Tooling
- Develop automation to eliminate manual, repetitive tasks and improve operational efficiency.
- Write clear, maintainable, and well-tested code to support automation efforts and system tooling.
- Drive initiatives to reduce operational toil and improve reliability through Infrastructure as Code (IaC).
Service Level Objectives & Operational Improvements
- Contribute to the definition, tracking, and continuous improvement of SLOs, Service Level Indicator’s (SLIs), and error budgets.
- Identify and prioritize operational improvements that align with business goals and user experience.
SRE Best Practices & Advocacy
- Helping to evangelize SRE principles across the organization.
- Collaborate with stakeholders to integrate reliability practices into the development lifecycle.
Collaboration & Knowledge Sharing
- Work closely with software engineering, DevOps, and infrastructure teams to streamline deployment and operational workflows.
- Improve cross-functional collaboration and promote a culture of shared responsibility for service reliability.
Documentation & Training
- Maintain accurate technical documentation, runbooks, and post-incident reports.
- Provide training and mentorship to engineering teams on best practices and tools.
Main duties of the job
- Ensure services are stable, scalable, performant and automated.
- Respond to incidents, troubleshooting issues, and restoring services as quickly as possible.
- Prioritise operational service improvements to meet or increase SLO, minimising downtime.
- Ensure that effective monitoring/alerting is in place to proactively identify issues using tools and dashboards. Reducing times to respond to issues.
- Leverage automation to streamline tasks, reduce overhead on repeatable operations, reduce manual intervention and improve efficiency. Write code that is maintainable, clear, and concise.
- Optimise system performance using strong problem-solving skills to identify bottlenecks with an engineering mindset.
- Ensure systems can handle current and future workloads through automation and capacity planning.
- Continuously improve services through observability, and identify ways to improve observability practices.
- Follow SRE principles. Guide and educate stakeholders to adopt implemented principles.
- Provide technical documentation for engineers. Providing training, where appropriate.
- Working closely with engineering and technology teams to improve operational processes, reduce manual tasks, ensure seamless collaboration/knowledge sharing, reduce risks and adapt to new ways of working.
This list is not exhaustive.
Person specification
Essential Criteria:
- Experience as a Site Reliability Engineer, DevOps Engineer, Operations Engineer or similar role
- Coding skills in programming/scripting languages such as Python, PowerShell or Bash
- Understanding of Linux/Unix & Windows systems, networking, and distributed systems
- Experience with observability tools (e.g., Prometheus, Grafana, Datadog) and alerting systems
- Understanding of infrastructure automation (e.g., Terraform, Ansible, PowerShell, Helm)
- Excellent communication and collaboration skills
- Experience with security best practices
- Possesses problem solving skills and the ability to respond to sudden unexpected demands
Desirable Criteria:
- Experience with CI/CD pipelines, cloud platforms (e.g., Amazon Web Services, Google Cloud Platform (AWS, GCP), Azure) and container orchestration (e.g., Kubernetes)
- Experience with post-incident reviews
- Previous involvement in driving adoption of SRE practices across an organization
- Experience delivering training or mentoring junior engineers
Alongside your salary of £41,983, UK Health Security Agency contributes £12,162 towards you being a member of the Civil Service Defined Benefit Pension scheme. Find out what benefits a Civil Service Pension provides.
- Learning and development tailored to your role
- An environment with flexible working options
- A culture encouraging inclusion and diversity
- A Civil Service pension with an employer contribution of 28.97%
Selection process details
This vacancy is using Success Profiles and will assess your Behaviours, Experience and Technical Skills.
Stage 1: Application & Sift
Success profiles
You Will Be Required To Complete An Application Form. You Will Be Assessed On The Listed 8 Essential Criteria, And This Will Be In The Form Of A:
- Application form (‘Employer/ Activity history’ section on the application)
- 1000 word Supporting Statement.
This should outline how your skills, experience, and knowledge, provide evidence of your suitability for the role, with reference to the essential criteria.
The Application form and Supporting Statement will be marked together.
Please note you will not be able to upload your CV. You must complete the application form in as much detail as possible. Please do not email us your CV.
Longlisting:
In the event of a large number of applications we will longlist into 3 piles of:
- Meets all essential criteria
- Meets some essential criteria
- Meets no essential criteria
Those falling into the 'Meets
all essential criteria' pile will progress to shortlisting.
Feedback will not be provided at this stage.
Shortlisting:
In the event of a large number of applications we will shortlist on the following essential criteria:
- Experience as a Site Reliability Engineer, DevOps Engineer, Operations Engineer or similar role.
Desirable criteria may be used in the event of a large number of applications / large amount of successful candidates.
If you are successful at this stage, you will progress to interview.
Please do not exceed 1000 words. We will not consider any words over and above this number.
Feedback will not be provided at this stage.
Stage 2: Interview
Success Profiles
You will be invited to a remote interview.
Candidates will be required to take a technical test, presentation and pass the interview process successfully. This allows us to set the rate of the MPS awarded.
Behaviours and Technical Skills will be tested at interview.
The Behaviours Tested During The Interview Stage Will Be:
- Changing and Improving – Lead Behaviour
- Making effective decisions
- Delivering at pace
- Working Together
You Will Also Be Expected To Prepare And Present a 5 Minute Presentation During The Interview. This Will Be Based On Either:
- Designing a highly available and scalable service OR
- Automating a complex operational process
This will be decided and confirmed ahead of interviews.
There Will Also Be a Technical Test During The Interview, Where You Will Be Asked Technical Based Questions To Test Your Knowledge. This Will Be Based On:
- SRE principles
- Troubleshooting/incident management,
- System design
- Automation/coding
- Knowledge in Linux & networking
- Cloud technologies
Interviews dates are yet to be confirmed.
Candidates will be required to take a technical test, presentation and pass the interview process successfully. This allows us to set the rate of the MPS awarded.
Once this job has closed, the job advert will no longer be available. You may want to save a copy for your records.
Eligibility Criteria Open to all external applicants (anyone) from outside the Civil Service (including by definition internal applicants).
Location
This role is being offered as hybrid working based at any of our Core HQ’s.
We offer great flexible working opportunities at UKHSA and operate using a hybrid working model where business needs allow. This provides us with greater flexibility about how and where we work, to get the best from our workforce. As a hybrid worker, you will be expected to spend a minimum of 60% of your contractual working hours (approximately 3 days a week pro rata, (averaged over a month) working at one of UKHSA's core HQ’s (Birmingham, Leeds, Liverpool, and London).
Our core HQ offices are modern and newly refurbished with excellent city centre transport link and benefit from benefit from co-location with other government departments such as the Department for Health and Social Care (DHSC).
Future location
UKHSA is investing in a new state-of-the-art National Biosecurity Centre in Harlow, Essex, which will eventually bring together teams currently based at Canary Wharf, Colindale and Porton Down. For more details, please see: Huge biosecurity centre investment to boost pandemic protection - GOV.UK.
The new facilities will start becoming operational in the mid-2030s, with full completion by 2038. Staff will move in phases as facilities become available. If you're appointed to a role currently based at Canary Wharf, Colindale or Porton Down, please note that we'll continue investing in these sites for the next decade. As we get closer to the transition, we'll provide full information about relocation support available to staff.
Salary bands
National £41,983 to £48,128
Inner London £46,310 to £52,113
This role attracts a Market Pay Supplement up to £5,000.
Please note: If you are successful at interview, and are moving from another government department, NHS, or Local Authority, the relevant starting salary principles for level transfers or promotions will apply. Otherwise, roles are offered at the pay scale minimum for the grade, but in exceptional circumstances there may be flexibility if you are able to demonstrate you are already in receipt of an existing, higher salary. Pay increases are through the relevant annual pay award for the role and terms.
Security Clearance Level Requirement Successful candidates must pass a disclosure and barring security check.
Successful candidates must meet the security requirements before they can be appointed. The level of security needed is Basic Personnel Security Standard.
Reasonable Adjustments
The Civil Service is committed to making sure that our selection methods are fair to everyone. To help you during the recruitment process, we will consider any reasonable adjustments that could help you. An adjustment is a change to the recruitment process or an adjustment at work. This is separate to the Disability Confident Scheme. If you need an adjustment to be made at any point during the recruitment process you should contact the recruitment team in confidence as soon as possible to discuss your needs.
You can find out more information about reasonable adjustments across the Civil Service here: https://www.civil-service-careers.gov.uk/reasonable-adjustments/
International Police check
If you have spent more than 6 months abroad over the last 3 years you may need an International Police Check. This would not necessarily have to be in a single block, and it could be time accrued over that period.
Artificial Intelligence (AI)
Artificial Intelligence can be a useful tool to support your application, however, all examples and statements provided must be truthful, factually accurate and taken directly from your own experience. Where plagiarism has been identified (presenting the ideas and experiences of others, or generated by artificial intelligence, as your own) applications may be withdrawn and internal candidates may be subject to disciplinary action. Please see our candidate guidance for more information on appropriate and inappropriate use.
Link Below:
Artificial intelligence and recruitment , Civil Service Careers
Internal Fraud check
If successful for this role as one aspect of pre-employment screening, applicant’s personal details – name, national insurance number and date of birth - will be checked against the Cabinet Office Internal Fraud Hub and anyone included on the database will be refused employment unless they can show exceptional circumstances. Currently this is only for External candidates to the Civil Service.
Careers website
Please visit our careers site for more information https://gov.uk/ukhsa/careers
Feedback will only be provided if you attend an interview or assessment.
Security
Successful candidates must undergo a criminal record check.
People working with government assets must complete baseline personnel security standard (opens in new window) checks.
Successful candidates must undergo a criminal record check.
People working with government assets must complete baseline personnel security standard (opens in new window) checks.
Nationality requirements
This Job Is Broadly Open To The Following Groups:
- UK nationals
- nationals of the Republic of Ireland
- nationals of Commonwealth countries who have the right to work in the UK
- nationals of the EU, Switzerland, Norway, Iceland or Liechtenstein and family members of those nationalities with settled or pre-settled status under the European Union Settlement Scheme (EUSS) (opens in a new window)
- nationals of the EU, Switzerland, Norway, Iceland or Liechtenstein and family members of those nationalities who have made a valid application for settled or pre-settled status under the European Union Settlement Scheme (EUSS)
- individuals with limited leave to remain or indefinite leave to remain who were eligible to apply for EUSS on or before 31 December 2020
- Turkish nationals, and certain family members of Turkish nationals, who have accrued the right to work in the Civil Service
Further information on nationality requirements (opens in a new window)
Working for the Civil Service
The Civil Service Code (opens in a new window) sets out the standards of behaviour expected of civil servants.
We recruit by merit on the basis of fair and open competition, as outlined in the Civil Service Commission's recruitment principles (opens in a new window).
The Civil Service embraces diversity and promotes equal opportunities. As such, we run a Disability Confident Scheme (DCS) for candidates with disabilities who meet the minimum selection criteria.
The Civil Service Code (opens in a new window) sets out the standards of behaviour expected of civil servants.
We recruit by merit on the basis of fair and open competition, as outlined in the Civil Service Commission's recruitment principles (opens in a new window).
The Civil Service embraces diversity and promotes equal opportunities. As such, we run a Disability Confident Scheme (DCS) for candidates with disabilities who meet the minimum selection criteria.
Diversity and Inclusion
The Civil Service is committed to attract, retain and invest in talent wherever it is found. To learn more please see the Civil Service People Plan (opens in a new window) and the Civil Service Diversity and Inclusion Strategy (opens in a new window).
The Civil Service welcomes applications from people who have recently left prison or have an unspent conviction. Read more about prison leaver recruitment (opens in new window).
Once this job has closed, the job advert will no longer be available. You may want to save a copy for your records.
Contact point for applicants
Job Contact :
- Name : Carmelita Goodwin
- Email : Carmelita.Goodwin@ukhsa.gov.uk
Recruitment team
- Email : recruitment@ukhsa.gov.uk
Further information
The law requires that selection for appointment to the Civil Service is on merit on the basis of fair and open competition as outlined in the Civil Service Commission's Recruitment Principles.
If you feel your application has not been treated in accordance with the Recruitment Principles, and you wish to make a complaint, in the first instance, you should contact UKHSA Public Accountability Unit via email: Complaints@ukhsa.gov.uk
If you are not satisfied with the response you receive from the Department, you can contact the Civil Service Commission: Visit the Civil Service Commission website: https://civilservicecommission.independent.gov.uk