Collaborate with and consult Architecture, Security, Privacy, and other subject matter experts within Nike
Act as the subject matter expert for resilience and high availability of consumer facing (e-commerce) systems/applications
Communicate to leadership the status of recovery time (RTO), recovery point objectives (RPO) for critical systems
WHO WE ARE LOOKING FORWithin the Reliability Engineering our goal is to provide technical solutions to complex production problems with a focus on reduction of incident and problem toil, speeding detection and recovery of critical incidents through observability and continuous improvement through operational health measurement and sharing.
Ability to lead assessments of applications and infrastructure components to identify gaps related to high availability/disaster recovery
5+ years of software development experience
5 years’ experience in building cloud-based enterprise systems, ideally on AWS.
Proficient in Java 8 and newer
Proficient with JavaScript on frontend (React, Angular, etc.) and backend (Node.js) components.
Demonstrable knowledge of Linux operating system internals, TCP/IP, filesystems, disk/storage technologies
Basic understanding of DNS, Networking, Virtualization
ˀExperience with Docker and/or serverless patterns.
Expertise in designing and building scalable Micro Services
Expertise in web and web-app patterns
Expertise in SQL and NoSQL datastore systems to build highly scalable solutions
Experience with expertise in other modern enterprise languages (functional or other – Python, Golang, etc.)
Experience with securing Restful APIs and Apps using OAuth and OpenID Connect and JWT
Good understanding of async/non-blocking Restful APIs approaches and frameworks
Experience within messaging (pub-sub) patterns
Demonstrated negotiation and influencing skills
Basic understanding of most of the following: ServiceNow, Jira, Jenkins, GitHub, Splunk, New Relic, or equivalent Application Performance Monitoring Tool
WHAT YOU’LL WORK ONThe following are a High Availability Engineer’s responsibility for this role but is not limited to:
Plan and lead chaos engineering exercises to expose weaknesses in Nike production systems, uncover any gaps in monitoring and observability, and put in place solutions to lower the effects of any failures on consumersPartner with infrastructure and application engineering teams to ensure solutions meet high availability/disaster recovery requirements, including gap identification, assessment, and remediationAssess current disaster recovery strategy, impacts, and risks including business, legal, and IT perspectivesPropose various application design patterns and develop disaster recovery scenarios/resilience requirementsAct as a subject matter expert to system owners on industry standards and best practices for disaster recoveryLead technical recovery efforts in the event of a site outage