The Site Reliability Engineer plays a critical role in ensuring that our AI-driven, cloud-native platform is reliable, observable, secure, and able to scale with the organisation’s growth. As we adopt intelligent agents, autonomous workflows, and increasingly complex distributed systems, the SRE ensures that resilience, performance, and operational excellence are built into everything we deliver. By partnering closely with Engineers, Architects, and the Engineering Manager, the SRE defines the patterns, tooling, and automation that enable fast, safe, and repeatable deployments.
This role safeguards our production environment, drives continuous improvement across CI/CD and observability, and establishes the reliability practices that empower autonomous squads to move quickly without compromising stability. The SRE is essential to maintaining customer trust, supporting AI-first innovation, and ensuring our platform remains robust, secure, and highly available at scale.
In this position you will ensure the reliability, scalability, and security of our engineering systems. Working closely with the Engineering Manager and Head of Engineering, the SRE will identify priorities to remove friction from engineering teams, streamline processes, and enhance operational excellence. This role combines software engineering principles with systems administration to deliver robust, automated, cost-effective, and secure-by-design solutions.
Key Responsibilities
Reliability, Performance & Security:
- Design and implement strategies to improve system reliability, availability, and security.
- Ensure all solutions follow secure-by-design principles, incorporating cybersecurity best practices from inception through deployment.
- Conduct regular security reviews and collaborate with security teams to address vulnerabilities.
CI/CD Management:
- Own and optimise Continuous Integration and Continuous Deployment pipelines.
- Embed security checks (e.g., static analysis, dependency scanning) into CI/CD workflows.
- Ensure secure, efficient, and automated deployment processes across environments.
Monitoring & Observability:
- Implement and maintain monitoring solutions for infrastructure and applications.
- Develop dashboards and alerting systems to ensure proactive incident and security event management.
- Evaluate and integrate new observability tools as needed.
Automation & Tooling:
- Automate repetitive tasks to improve efficiency and reduce human error.
- Build and maintain internal tools that support engineering productivity and security compliance.
- Champion Infrastructure as Code (IaC) practices using tools like Terraform or ARM templates.
Cloud Infrastructure Management:
- Manage and optimise services across AWS and Azure environments.
- Ensure scalability, resilience, and security of service-based architectures.
- Implement cost management strategies to optimise cloud spend without compromising performance or security.
Incident Response & Root Cause Analysis:
- Lead incident response efforts, including security incidents, and conduct post-mortem reviews.
- Drive continuous improvement through lessons learned and preventive measures.