Location: Sunnyvale / Bentonville
Department: Reliability Engineering / Business Reliability Engineering (BRE)
Reports To: VP of Reliability Engineering & Operations
Position Summary
Walmart Global Tech is redefining enterprise resilience and operational continuity at an unprecedented scale. We are seeking a Senior Director, Agentic AI to lead the development and
deployment of intelligent, automated resilience systems that ensure uninterrupted operations across our global omnichannel ecosystem. From safeguarding multi-billion-dollar revenue streams to protecting essential
services for millions of customers, this role will be critical in advancing Walmart's mission to help people save money and live better—through bulletproof operational resilience and proactive disaster recovery capabilities.
This role will be critical in ensuring zero downtime for mission-critical services, advancing AI-driven recovery automation, and redefining how associates, customers, and systems interact across Walmart’s omnichannel ecosystem. From One Click DR Certification to goal-driven AI agents for predictive recovery and operational continuity, you will shape the backbone of Walmart’s global resilience strategy.
What You’ll Do
Strategic Vision & Leadership
• Define Walmart's enterprise-wide strategy and roadmap for Business Continuity and
Disaster Recovery—automated systems capable of predictive failure detection, intelligent
failover orchestration, and multi-region recovery execution across all business domains
• Partner with product, infrastructure, engineering, and operations teams to build end-to-
end resilience solutions that power continuous availability at Fortune 50 scale
• Lead cross-functional initiatives spanning across multiple organizations Data Systems,
Application, distributed systems and cloud and Infrastructure engineering teams• Build a unified roadmap that integrates automated disaster recovery certification with
agentic AI-driven monitoring, prediction, and failover orchestration.
• Define and ensure every critical system meets strict Recovery Time Objective (RTO) and
Recovery Point Objective (RPO) standards.
Technology Execution
• Build and Lead SRE SWAT Team
o Lead Walmart’s elite SRE SWAT Team, a battle-ready force specializing in rapid
response to large-scale incidents, with a relentless focus on real-time mitigation to
protect multi-billion-dollar revenue streams.
o Conduct mock simulations, audits, and readiness drills to validate disaster
recovery capabilities across critical systems.
o Review and optimize existing playbooks, policies, tools, and processes managed
by multiple teams to ensure faster and more effective recovery outcomes.
o Train and enable frontline teams such as the Command Center (CCC) and partner
groups to execute recovery playbooks with precision, reducing downtime during
critical incidents.
o Serve as the center of excellence for enterprise incident readiness, raising
organizational resilience standards across Walmart Global Tech.
• Business Continuity
o Architect and implement Walmart’s Enterprise Business Continuity
Certification System—a comprehensive framework that enforces resilience
standards across all critical applications, infrastructure, and business processes.
o Develop AI-driven business continuity platforms that proactively assess
operational risks, monitor cross-domain dependencies, and recommend automated
safeguards.
o Embed One Click Business Continuity Certification across all business units,
transforming readiness into a repeatable, scalable process.
o Partner with legal, compliance, and enterprise risk teams to ensure continuity
practices align with regulatory and operational risk requirements.
o Establish continuous monitoring and policy-driven governance for business-
critical operations, ensuring alignment with Walmart’s mission to provide
uninterrupted customer service.
• Disaster Recovery
o Architect and implement Walmart’s Agentic AI System, enforcing stringent
Recovery Time Objective (RTO) and Recovery Point
Objective (RPO) standards across all critical systems.
Build Agentic AI-powered recovery orchestration platforms that
autonomously detect failures, reason across distributed systems, and trigger
recovery workflows.
Deliver One Click DR Certification, ensuring every critical service can be
recovered in a standardized, automated, and auditable way.
Design and execute enterprise-scale disaster recovery simulations to validate
failover mechanisms, database integrity, cache synchronization, and system-wide
resilience under real-world stress conditions.
Provide real-time dashboards for leadership visibility into DR readiness,
failover confidence, and AI-driven decision pathways.
Team Building & Talent Development
• Recruit, mentor, and grow global teams of resilience engineers, AI researchers, and
applied scientists.
• Create career development pathways and specialized skill development programs for
disaster recovery engineering
• Expand the Agentic AI program into a dedicated function, creating over 100+
specialized roles that strengthen U.S. workforce expertise.
• Foster a culture of innovation at the intersection of AI, resilience engineering, and large-
scale operations.
Enterprise Risk Management & Governance
• Champion automated, policy-driven validation systems with comprehensive
compliance frameworks and audit capabilities
• Partner with legal, compliance, and enterprise risk teams to ensure AI
solutions align with regulatory standards and operational risk management.
• Establish governance processes ensuring all critical applications meet enterprise
RTO/RPO standards through continuous monitoring and validation
• Work with executive leadership, legal, and compliance teams to ensure regulatory
adherence and risk mitigation across all business units
• Build real-time dashboards providing visibility into disaster recovery readiness and AI-
driven decision pathways.
External & Internal Influence
• Represent Walmart at top-tier AI, cloud, and resilience conferences, shaping global best
practices in enterprise AI automation.
• Establish academic and industry partnerships to pioneer AI-driven resilience research.
Drive cultural adoption of both disaster recovery discipline and agentic AI innovation
across Walmart’s technology and retail organizations.
Industry Leadership & Innovation
• Represent Walmart at top-tier infrastructure and reliability engineering conferences,
establishing thought leadership in enterprise practices
• Drive industry standards development, with methodologies adopted by NIST and Fortune
500 companies
• Lead internal transformation and change management to integrate resilience engineering
into core development and operations practices
You’ll Make an Impact By
• Protecting Multi-Billion Dollar Revenue Streams: Mitigating risks of systemic outages
that could disrupt millions per hour during peak sales events.
• Launching the One Click DR Certification Program, creating an industry benchmark for
platform-based recovery automation.
• Deploying AI agents that streamline operational workflows, predict failure conditions,
and autonomously initiate recovery actions.
• Transforming Customer Experience: Ensuring uninterrupted access to mission-critical
services for millions of customers nationwide, including essential groceries,
pharmaceuticals, and household necessities
Qualifications
• 16+ years of experience in Site Reliability Engineering, Production Engineering, and
Infrastructure Reliability, with extensive leadership across Fortune 50 enterprises and
global-scale platforms.
• Proven track record of building and leading enterprise-scale programs—including
automation of disaster recovery certification, resiliency-as-a-service platforms, and large-
scale incident management
• Deep understanding of Hybrid cloud architectures (private + public) using OpenStack,
GCP, Azure and networking expertise
• Hands-on expertise with Agentic AI and resilience automation, including design of AI-
powered reasoning systems for predictive risk detection, automated failover
orchestration, and policy-driven continuity validation.
• Deep expertise in:
o Building enterprise platforms and certification frameworkso Designing agentic AI systems for reasoning, prediction, and multi-step task
execution
o Multi-region recovery, data synchronization, and simulation exercises
o Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
governance
• Recognized leader in SRE and Resilience Engineering, driving critical business
continuity programs and Site Reliability Teams
• Demonstrated impact at Fortune 50 scale, including protecting multi-billion-dollar
eCommerce and retail operations during high-stakes events
Preferred Experience
• Retail/eCommerce Expertise: Experience with high-volume transaction systems, point-
of-sale infrastructure, inventory management, and fulfillment operations
• SRE SWAT Team Development: Building and operationalizing elite rapid-response SRE
teams trained for large-scale incident command, playbook execution, and real-time
mitigation.
• Global Operations: Experience managing disaster recovery across multiple geographic
regions and regulatory environments
• Experience managing programs across hybrid cloud and distributed architectures.
• Knowledge of retail, supply chain, or eCommerce AI applications a strong plus.
At Walmart, we offer competitive pay as well as performance-based bonus awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision and dental coverage. Financial benefits include 401(k), stock purchase and company-paid life insurance. Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting. Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable.
For information about PTO, see https://one.walmart.com/notices.
Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart.
Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms.
For information about benefits and eligibility, see One.Walmart.
Sunnyvale, California US-11657: The annual salary range for this position is $208,000.00 - $416,000.00
Bentonville, Arkansas US-09050: The annual salary range for this position is $160,000.00 - $320,000.00
Additional compensation includes annual or quarterly performance bonuses.
Additional compensation for certain positions may also include :
- Stock
#GlobalTechPlatform
ㅤ
ㅤ
ㅤ
ㅤ
Minimum Qualifications...Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.
Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and7 years’ experience in site reliability engineering, site and system administration, infrastructure management, or related area.Option 2: 9 years’ experience in site reliability engineering, site and system administration, infrastructure management, or related area.4 years’ supervisory experience.Preferred Qualifications...Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.
Experience in site reliability engineering, site and system administration, infrastructure management, or related area., Master's degree in site reliability engineering, site and system administration, infrastructure management, or related area and 5 years’ experience in site reliability engineering, site and system administration, infrastructure management, or related area., SRE certification (for example, IBM Cloud Site Reliability Engineer)., We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly. The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart’s accessibility standards and guidelines for supporting an inclusive culture.Primary Location...1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of AmericaWalmart and its subsidiaries are committed to maintaining a drug-free workplace and has a no tolerance policy regarding the use of illegal drugs and alcohol on the job. This policy applies to all employees and aims to create a safe and productive work environment.