DE, USA
8 days ago
Director, Site Reliability Engineering (SRE) Shared Services Leader
Job Description Job Description: Director, Site Reliability Engineering (SRE) Shared Services Leader Overview

We are seeking a passionate, innovative, and experienced Director of Site Reliability Engineering (SRE) Shared Services to lead and shape the future of our reliability engineering practices. In this role, you will drive the organization’s initiatives to ensure the scalability, performance, and reliability of our technology systems, while fostering a shared-services approach that serves the needs of development and infrastructure teams across the enterprise.

This is an incredible opportunity to collaborate with a diverse, forward-thinking team while championing growth, inclusivity, and operational excellence. If you’re a strong leader with a proven track record in SRE and a dedication to empowering others through shared services, we encourage you to apply.

We are an organization that values diversity, equity, and inclusion in all its forms, and we are committed to building a team that reflects our core belief that all employees deserve a supportive, accessible, and opportunity-rich working environment.

Responsibilities Leadership & Strategy: Drive and develop strategic initiatives for SRE shared services that enhance reliability, availability, and performance across all platforms and services. Serve as a key leader in the organization to foster collaboration and alignment around SRE goals. Infrastructure Excellence: Design, build, and manage resilient shared infrastructure systems while automating processes to improve system reliability, scalability, and operational efficiency. Collaboration & Enablement: Act as a cross-functional leader to align engineering, IT, and product teams by providing tools, resources, and platforms that standardize reliability practices organization-wide. Incident Response & Prevention: Oversee and refine incident response systems, offering best-in-class solutions to mitigate risks and minimize downtime through root cause analysis, runbooks, and response optimization. Talent Management & Development: Build and lead a diverse, high-performing SRE team. Foster an inclusive, growth-oriented culture by mentoring team members, driving innovation, and promoting ongoing professional development. Continuous Improvement: Develop industry-leading metrics to measure system health and uptime. Identify opportunities to improve processes, deploy resources thoughtfully, and optimize performance through automation and feedback loops. Tech Roadmap: Collaborate with executive leadership to define and execute the SRE roadmap and technology vision, ensuring world-class operational standards and user experiences. Qualifications

Basic Qualifications:

Proven experience (10+ years) in Site Reliability Engineering, DevOps, or Infrastructure Engineering, with at least 5+ years in leadership or managerial roles. Strong expertise in site management, field engineering, and system performance monitoring across scalable systems and platforms. Advanced technical experience in development, automation, and cloud computing (e.g., AWS, Azure, GCP). Demonstrable track record in leading enterprise-wide shared services initiatives and delivering quantifiable outcomes. Strong understanding of SRE principles, frameworks, and best practices including SLAs, SLOs, and error budgets. Excellent communication and collaboration skills, with the ability to work across technical and non-technical teams at all levels. Solid ability to handle high-pressure situations, prioritize rapidly evolving needs, and make data-driven decisions.

Preferred Qualifications:

Advanced degree (MBA/MS) in Computer Science, Engineering, or related field. Experience driving digital transformation in large and complex organizations. Familiarity with AIOps, Machine Learning tools, and next-generation monitoring systems to improve reliability and predict performance issues. Deep commitment to fostering an inclusive workplace and a demonstrated ability to drive cultural change. Day-to-day Lead daily stand-ups and regular scrums across the SRE function to ensure objectives and deliverables are on track. Collaborate with development, product, and IT teams to establish achievable reliability targets. Iterate on shared-services platforms to improve onboarding and enablement for internal teams. Monitor system reliability with tools like Prometheus, Grafana, or Splunk, and proactively address bottlenecks or vulnerabilities. Represent the SRE function in executive leadership meetings, providing key data insights and updates on performance benchmarks and challenges. Develop and implement training and knowledge-sharing initiatives to empower internal teams to maximize the effectiveness of shared services. Mentor and coach team members, fostering a culture of accountability, inclusivity, and technological curiosity.

We are committed to recruiting a diverse workforce and providing equal opportunities for all applicants, regardless of race, ethnicity, gender identity, disability, sexual orientation, or background. We encourage candidates of all experiences to apply and bring their unique perspectives to our team!

Ready to join our mission-driven team? Apply today!

Confirmar seu email: Enviar Email
Todos os Empregos de BMA Group