We are looking for Site Reliability Engineer to join us at Thales and work with our Payment Solutions. The Site Reliability Engineer empowers product, delivery, and SRE teams to implement a holistic observability approach across AWS and GCP. We design observability standards, build reusable frameworks and partner with teams to achieve end-to-end visibility—from Node.js and Java services to business outcomes. Our mission: make service performance measurable, detect incidents proactively, and accelerate investigations with trustworthy telemetry.
Day in Life of SRE:
Build and maintain observability frameworks for AWS/GCP
Create reusable Datadog instrumentation for Node.js and JavaProvide auto-instrumentation templates and enforce observability quality standardsPublish Terraform modules for Datadog resources and cloud integrationsOwn Datadog dashboards and measurement standards
Define and curate source-of-truth dashboards and KPIsEstablish golden signals and semantic conventions across servicesManage observability-as-code repos in GitLabImprove monitoring, alerting, and incident readiness
Design precise, low-noise Datadog monitors and routingImplement synthetics for critical flows and correlate with traces/logsPartner with SREs on SLOs, error budgets, and incident triggersDrive continuous learning and adoption
Turn post-incident learnings into improved monitors, dashboards, and CI/CD checksDeliver training, documentation, and hands-on support for developers and SREsConsult, enable, and optimize
Coach teams on instrumentation and APM best practicesStrengthen AWS/GCP observability integrations and tagging strategyOptimize Datadog cost, sampling, retention, and cardinality; rationalize monitorsTypical interactions:
SRE: alert quality, troubleshooting, SLOs, post-incident reviews
Product/Dev: instrumentation, trace propagation, business KPIs
Platform/Infra: cloud integrations, Terraform, RBAC, cost/performance
Security/Compliance: telemetry governance, PII controls, retention policies
Leadership: service health roll-ups, reliability and adoption metrics
Skills & experience:
Strong engineering background in Node.js and/or Java (Datadog dd-trace, async context propagation, middleware patterns)
Cloud expertise in AWS — serverless, containers, managed services, and integrating cloud telemetry with Datadog
Automation skills with GitLab CI/CD and Terraform (Datadog resources, modules, workflows)
Datadog proficiency — APM, logs, metrics, synthetics, monitors, SLOs, and observability-as-code practices
Observability mindset — defining SLIs/SLOs, improving alert quality, and supporting the full incident lifecycle
Strong communication skills — clear documentation, training delivery, and confident English communication with distributed teams
At Thales we provide CAREERS and not only jobs. With Thales employing 80,000 employees in 68 countries our mobility policy enables thousands of employees each year to develop their careers at home and abroad, in their existing areas of expertise or by branching out into new fields. Together we believe that embracing flexibility is a smarter way of working. Great journeys start here, apply now!