Manager of Site Reliability Engineering
Insight Global
Job Description
About the role, this is an advanced management level role. Individuals are required to manage multiple SRE teams within a single product group. You will ensure teams are working in alignment with the SRE framework, including leading sustainable incident response, blameless post-mortems, and production reliability improvement projects. You will mentor other team members on SRE practices and cultivate innovation and collaboration across multiple teams. Manages delivery of and may provide input to strategy and departmental plans.
About the team, this role is part of the Business Systems SRE team. As a SRE Manager, you will act as a technical and strategic leader, partnering with engineering and business stakeholders to drive cloud reliability, automation, observability, and performance initiatives across critical platforms. This role combines technical depth with managerial acumen, including leading Proof-of-Concept (PoC) initiatives, guiding teams, and aligning SRE outcomes with leadership expectations and business goals.
Responsibilities:
• Managing high performance SRE teams ideally in multiple counties. We are not looking for an individual contributor.
• Promoting and implementing Site Reliability Engineering best practices and principles across product and platform teams
• Architecting, implementing, and managing infrastructure using Infrastructure as Code (IaC) and DevOps principles
• Designing and maintaining secure-by-default cloud-native systems with a focus on continuous improvement of security posture
• Defining and enforcing SLA/SLI/SLO standards for production systems
• Developing and maintaining automated frameworks for provisioning, deployment, scaling, and monitoring
• Conducting in-depth troubleshooting of complex production issues across application, infrastructure, and network layers
• Leading proof-of-concept efforts to evaluate and introduce new technologies
• Implement policy and compliance checks within CI/CD pipelines
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Skills and Requirements
Essential Skills & Experience:
• Current and extensive experience managing teams of SRE’s. We are not looking to hire an individual contributor in this role.
• Proficiency with at least one major public cloud provider: Azure, AWS
• Extensive experience with Terraform, Ansible, and other IaC/orchestration tools
• Expertise in Kubernetes (AKS/EKS/GKE), containerized workloads, and deployment strategies (e.g., Blue Green)
• Deep knowledge of Linux and Windows server environments
• Proven experience in building and enforcing automation frameworks for CI/CD and infrastructure provisioning
• Hands-on experience with observability platforms such as Grafana, Kibana, Splunk, ELK Stack (Elasticsearch, Logstash, Kibana), OpenTelemetry, Prometheus, Loki
• Strong knowledge of SLAs, SLIs, and SLOs and their application in production environments
• Experience with monitoring, alerting, and logging best practices
• Solid understanding of cloud-native security, identity management, and secrets management (e.g., HashiCorp Vault)
• Skilled in scripting and programming (e.g., Python, Bash, Golang, PowerShell, C#)
• Strong knowledge of networking, application performance tuning, and troubleshooting
• Familiarity with common CI/CD and version control tools (e.g., Git, GitLab, GitHub, Jenkins)
Confirmar seu email: Enviar Email
Todos os Empregos de Insight Global