About the Role We are seeking a highly skilled Platform Automation Engineer with a strong software engineering background to join our Site Reliability Engineering (SRE) team. This role is coding-heavy, focused on developing automation, building resilient services, and ensuring observability and reliability at scale. You will design, code, and deliver automation solutions in Python/Go, create and manage APIs, develop robust observability frameworks, and engineer automated backup and recovery systems. Working closely with Cloud and Application teams, you will drive operational excellence through automation and reliability practices. Key Responsibilities
Automation & Engineering
Design, code, test, and deploy software to automate manual operational tasks.
Develop APIs and services that enhance reliability, scalability, and observability.
Build and manage software-based infrastructure components across cloud and hybrid environments.
Reliability & Incident Management
Troubleshoot priority incidents, lead post-mortems, and drive permanent resolutions.
Balance operational support with engineering initiatives for optimal efficiency.
Participate in rotational on-call support as needed.
Observability Engineering
Develop best-in-class monitoring frameworks with Prometheus, Grafana, CloudWatch, Azure Monitor, Honeycomb or similar.
Implement noiseless alerting, end-to-end telemetry, and data-driven SLO improvements.
Create automated solutions for upgrades, release management, and change processes.
Backup & Recovery Automation
Engineer cloud-native automated backup and recovery pipelines.
Implement advanced data protection solutions including cyber vault isolation and code-driven recovery.
Safeguard data integrity across cloud and on-premises environments.
Collaboration & Leadership
Work closely with Cloud Centre of Excellence (CCoE) and Development teams across the lifecycle.
Coach and mentor team members; lead delivery on complex engineering tasks.
Contribute to a culture of continuous improvement, reliability, and resilience.
Skills & Experience Required
Software Engineering & Automation
Proficiency in Python and/or Go for automation and service development.
Strong experience with API design, development, and integration.
Hands-on expertise in automation frameworks, CI/CD, and Infrastructure as Code (e.g., Terraform).
SRE & Observability
Experience in designing monitoring and observability solutions (Prometheus, Grafana, CloudWatch, Azure Monitor, etc.).
Knowledge of performance monitoring, capacity management, and telemetry pipelines.
Exposure to system troubleshooting, stability engineering, and incident response.
Backup, Recovery & Data Protection
Proven experience in automated backup and recovery solutions in cloud environments.
Familiarity with data integrity, vaulting mechanisms, and code-driven resilience processes.
Cloud & Infrastructure
Experience with container orchestration, compute, storage, and network services in cloud platforms.
Understanding of security principles: SSO, Kerberos, LDAP, Active Directory, etc.
Business Awareness
Risk-aware mindset, with experience in production environments (financial services background a plus).
Strong understanding of high-availability and resilience principles.
What We're Looking For
A software-minded SRE who codes first, automates everything, and thrives on solving operational challenges through engineering.
Someone who can balance coding, automation, and reliability with hands-on incident troubleshooting.
A proactive engineer who drives improvements, shares knowledge, and helps shape a modern automation-driven SRE practice.