Senior AWS Engineer – Observability & Monitoring (CloudWatch | Python | CI/CD)
UST Global Inc
We are seeking an experienced Senior AWS Engineer to design, implement, and optimize end-to-end observability solutions across multiple AWS-based systems. The ideal candidate will have deep expertise in AWS CloudWatch, AWS X-Ray, Lambda monitoring, and infrastructure-as-code practices, combined with a strong understanding of service-level objectives (SLOs/SLAs) and ing integrations such as OpsGenie or PagerDuty.
This role is pivotal in shaping our observability strategy and mentoring junior engineers in building reliable, data-driven monitoring systems.
Key Responsibilities
Design and implement a comprehensive CloudWatch alarm and monitoring strategy across 13 AWS solutions. Define and operationalize SLOs/SLAs for critical business workflows and services. Instrument application code with custom metrics using Python and Embedded Metric Format (EMF) where visibility gaps exist. Validate and enhance AWS X-Ray implementation to ensure full traceability and performance insight. Create and maintain ing and escalation workflows using OpsGenie (or PagerDuty). Optimize existing CloudWatch dashboards (e.g., doc-manager, state-machine) and build new dashboards for remaining solutions. Employ CloudFormation (or equivalent IaC tools) to manage and automate observability resources. Mentor junior engineers, promoting best practices in observability, monitoring automation, and performance analysis.Required Skills & Expertise
Advanced proficiency with AWS CloudWatch (alarms, dashboards, metrics, Logs Insights). Strong understanding of AWS Lambda monitoring and optimization techniques. Hands-on experience with AWS X-Ray for distributed tracing and performance debugging. Expertise in CloudFormation or Infrastructure-as-Code (IaC) frameworks for observability infrastructure. Proficiency in Python, especially for custom metric instrumentation and automation. Knowledge of EMF (Embedded Metric Format) for structured metric data. Proven experience defining and implementing SLOs/SLAs. Familiarity with OpsGenie or PagerDuty for ing and incident management workflows.
Confirmar seu email: Enviar Email
Todos os Empregos de UST Global Inc