Platform Automation Engineer

Location: Ireland

Work Type: Permanent / Full Time

Job Ref: 1230982

Platform Automation Engineer (SRE - Coding Heavy)
Location: Dublin (Hybrid)

About the Role
We are seeking a highly skilled Platform Automation Engineer with a strong software engineering background to join our Site Reliability Engineering (SRE) team. This role is coding-heavy, focused on developing automation, building resilient services, and ensuring observability and reliability at scale.
You will design, code, and deliver automation solutions in Python/Go, create and manage APIs, develop robust observability frameworks, and engineer automated backup and recovery systems. Working closely with Cloud and Application teams, you will drive operational excellence through automation and reliability practices.
Key Responsibilities

Automation & Engineering
- Design, code, test, and deploy software to automate manual operational tasks.
- Develop APIs and services that enhance reliability, scalability, and observability.
- Build and manage software-based infrastructure components across cloud and hybrid environments.
Reliability & Incident Management
- Troubleshoot priority incidents, lead post-mortems, and drive permanent resolutions.
- Balance operational support with engineering initiatives for optimal efficiency.
- Participate in rotational on-call support as needed.
Observability Engineering
- Develop best-in-class monitoring frameworks with Prometheus, Grafana, CloudWatch, Azure Monitor, Honeycomb or similar.
- Implement noiseless alerting, end-to-end telemetry, and data-driven SLO improvements.
- Create automated solutions for upgrades, release management, and change processes.
Backup & Recovery Automation
- Engineer cloud-native automated backup and recovery pipelines.
- Implement advanced data protection solutions including cyber vault isolation and code-driven recovery.
- Safeguard data integrity across cloud and on-premises environments.
Collaboration & Leadership
- Work closely with Cloud Centre of Excellence (CCoE) and Development teams across the lifecycle.
- Coach and mentor team members; lead delivery on complex engineering tasks.
- Contribute to a culture of continuous improvement, reliability, and resilience.

Skills & Experience Required

Software Engineering & Automation
- Proficiency in Python and/or Go for automation and service development.
- Strong experience with API design, development, and integration.
- Hands-on expertise in automation frameworks, CI/CD, and Infrastructure as Code (e.g., Terraform).
SRE & Observability
- Experience in designing monitoring and observability solutions (Prometheus, Grafana, CloudWatch, Azure Monitor, etc.).
- Knowledge of performance monitoring, capacity management, and telemetry pipelines.
- Exposure to system troubleshooting, stability engineering, and incident response.
Backup, Recovery & Data Protection
- Proven experience in automated backup and recovery solutions in cloud environments.
- Familiarity with data integrity, vaulting mechanisms, and code-driven resilience processes.
Cloud & Infrastructure
- Experience with container orchestration, compute, storage, and network services in cloud platforms.
- Understanding of security principles: SSO, Kerberos, LDAP, Active Directory, etc.
Business Awareness
- Risk-aware mindset, with experience in production environments (financial services background a plus).
- Strong understanding of high-availability and resilience principles.

What We're Looking For

A software-minded SRE who codes first, automates everything, and thrives on solving operational challenges through engineering.
Someone who can balance coding, automation, and reliability with hands-on incident troubleshooting.
A proactive engineer who drives improvements, shares knowledge, and helps shape a modern automation-driven SRE practice.