Disclaimer: Hunt UK Visa Sponsors aggregates job listings from publicly available sources, such as search engines, to assist with your job hunting. We do not claim affiliation with Sage. For the most up-to-date job details, please visit the official website by clicking "Apply Now."
Job Description
The Cloud Operations SRE teams are responsible for the performance and operations of Sage’s online Product portfolio, ensuring products and services remain available, secure and performant. This is both a technical lead and people lead role, working on some of our strategic products and services.
We are seeking a strategic and hands-on Site Reliability Engineering (SRE) Team Leader to lead our infrastructure operations team through a transformative journey from a reactive, operations-centric model to a proactive, automation-first, reliability engineering culture. This is a critical leadership role for someone with a passion for reliability, observability, infrastructure as code, and driving operational excellence through engineering.
You will lead and mentor a team of SREs and Ops engineers, enabling them to grow into true reliability engineers. You’ll partner with application engineering, product, and security teams to design, scale, and operate systems that are highly available, resilient, and self-healing.
This is a hybrid role – three days per week in our Newcastle office.
What You'll Do:
Team Transformation & Leadership
- Lead the cultural and technical shift from traditional operations to a modern SRE approach, further cementing this team as a trailblazer and example of best practice in the business.
- Define and drive the adoption of SRE principles: SLIs/SLOs, error budgets, blameless postmortems, automation, and toil reduction.
- Coach and mentor team members in software engineering best practices, infrastructure as code, CI/CD, and cloud-native tooling.
Reliability & Automation
- Architect and improve system reliability through automated monitoring, alerting, testing, and remediation.
- Drive the reduction of manual processes (toil) by introducing scalable and sustainable automation.
- Establish and enforce performance, reliability, and availability goals.
Collaboration & Partnership
- Act as the liaison between SRE, application development, product, and leadership teams to ensure reliability is embedded across the stack.
- Foster a strong DevOps culture by collaborating on deployment pipelines, incident response processes, and observability enhancements.
- Guide the team in adopting “you build it, you run it” principles across engineering.
- Define "what good looks like" and hold the team to account to deliver it.
- Advocate for the team to senior leadership and stakeholders. We're going to do great things; others need to know about it and seek to emulate it in their teams.
Operational Excellence
- Lead incident response and root cause analysis with a focus on learning and continuous improvement.
- Evolve incident management and on-call practices to align with modern SRE standards.
- Measure and report on operational metrics, including system uptime, MTTR, and error rates.
Technical Background
- Our services run in both AWS and Azure, so an understanding of one of these is essential.
- We run services on both Windows and Linux, so you must be comfortable working with both environments. We're not expecting you to be expert in either, but you will need basic understanding and troubleshooting skills for both.
- We run our services on a mixture of VMs and containers with the direction of travel being towards the latter. You will need to work with both.
- Coding should feature somewhere in your background. We write tools and automations for ourselves and others, the development of which will need to be overseen by you (and you'll probably contribute to it as well).
- Understanding of the considerations and limitations of SQL Server and Redis would be beneficial, but not essential.