Hyderabad, Telangana, India
15 hours ago
Associate Manager SRE
Overview We are seeking a self-driven, inquisitive, and curious Site Reliability Engineer (SRE) to drive reliability, availability, performance, and security across our global digital product ecosystem. This role is central to ensuring a seamless and resilient experience for our users by blending deep engineering expertise with operational excellence and automation. You will be part of a global SRE practice supporting a portfolio of 260+ modern cloud-native applications across consumer, commercial, supply chain, and enablement functions. Your mission: prevent incidents before they occur, ensure rapid recovery when they do, and build scalable systems that evolve with our growing business. Responsibilities Champion reliability, observability, and operational excellence across mission-critical applications. Develop and maintain service-level indicators (SLIs), objectives (SLOs), and error budgets to measure and improve system performance. Implement automated monitoring, alerting, and recovery mechanisms to reduce manual intervention and improve response times. Collaborate closely with software engineering, platform, and operations teams to embed SRE practices across the development lifecycle. Lead and participate in incident response, root cause analysis, and postmortem reviews to drive long-term improvements. Identify and eliminate sources of toil through automation, tooling, and process refinement. Continuously improve resiliency design, capacity planning, and release management in production systems. Influence engineering teams with best practices on cloud-native architecture, observability, and deployment strategies. Qualifications Required Skills: 5+ years of experience in production engineering, DevOps, or SRE roles. Strong foundation in Linux systems, networking, and cloud platforms (Azure, AWS, or GCP). Hands-on experience with observability tools (e.g., AppDynamics, Prometheus, Grafana, ELK, FullStory). Proficiency in scripting or programming (e.g., Python, Bash, Go) and automation frameworks (e.g., Ansible, Terraform). Deep understanding of CI/CD pipelines, release strategies, and deployment automation. Experience in managing high-scale, distributed systems in cloud-native environments. Strong analytical skills and a passion for continuous improvement. Preferred Skills: Familiarity with microservices, Kubernetes, containers, and service mesh architecture. Exposure to incident and problem management frameworks (e.g., ITIL, RCA practices). Experience working in global teams supporting mission-critical applications.
Confirmar seu email: Enviar Email