About This Role
We’re looking for an SRE with strong Kafka experience and a deep understanding of SRE best practices. You’ll combine hands‑on technical improvements with the ability to delegate work effectively to EventBus developers.
You’ll collaborate closely with the EventBus, Kafka, Telemetry, and Incident Response teams, while also working independently to improve monitoring, reduce noise, strengthen alerting, and track remediation progress.
This role sits at the centre of a global platform used by hundreds of developers and joins a fast‑growing, experienced SRE group based in Edinburgh.
About The Team
The Aladdin EventBus is built on Kafka and enables teams to publish and subscribe to distributed events in near real time. As part of the Aladdin Graph group—a core Platform Engineering function—the EventBus team supports developers across the firm in designing, building, and operating event‑driven and API‑based systems.
EventBus is now a critical dependency for key applications, including our release system and API infrastructure. This drives a high bar for availability, incident responsiveness, and operational excellence. The SRE function supports this by improving observability, streamlining incident processes, and identifying gaps that meaningfully improve platform reliability.
Key Responsibilities
As the SRE for EventBus, you will drive stability, resiliency, and observability through:
- Staying informed on all EventBus incidents, including impact, root cause, detection, and ongoing remediation
- Responding to incidents calmly and efficiently, communicating clearly with reporters and partner teams, and recommending remediations based on urgency and impact
- Proposing improvements informed by prior incidents, potential risks, and industry standards—e.g., new metrics, SLOs, fallback mechanisms
- Leading incident retrospectives and sharing insights with the wider team