System Development Manager, Cloud compute/gpu/storage server team

Cupertino, CA, US

3 days ago

Amazon.com

We have two distinct System Development Manager positions open — one leading the storage server team and one leading the AI/ML (GPU-based) accelerator server team. Because the
core responsibilities, technical depth, and leadership expectations overlap significantly across both roles, we are accepting applications through this single posting. During the interview process, we will assess fit for both positions and align candidates to the team where their experience and interests are the strongest match.

We are looking for a forward-thinking technical leader to manage a diverse, cross-functional team of Hardware Design Engineers, Systems Development Engineers, and Technical Program Managers responsible for developing storage or accelerated (AI/ML/GPU) server platforms for AWS.

This is not a role for someone who manages from a distance. You will set the technical vision and architectural direction for next-generation server platforms — making bold bets on where storage or accelerated compute infrastructure needs to go — and then build and lead the team that delivers it. Your success is measured not just by launching hardware, but by driving fast instance adoption because you built the right thing for the customer.

You will own the full lifecycle — design, build, test, deploy to the data center, launch, and fleet health beyond launch. You will lead a team of architects
defining what we build next, an NPI team delivering it through build and test to the data center, and an operations-focused engineering team ensuring it runs reliably at scale long after launch. You will connect these functions into a single, cohesive organization that moves fast and delivers high-quality server platforms that customers want to adopt.

You will work across organizational boundaries — with other AWS service teams to deeply understand customer workloads and translate
that understanding into hardware architecture decisions. You will lead relationships with ODMs and design partners to develop and manufacture your products at scale. When complex technical problems arise — across hardware, firmware, software, thermal, power, or signal integrity — you will have the technical depth to engage meaningfully and the judgment to drive the right trade-offs.

Key job responsibilities
Vision & Architecture
- Set the technical vision and multi-generational roadmap for storage or accelerator (AI/ML/GPU-based) server platforms
- Make architectural bets that differentiate AWS — anticipating customer needs and industry shifts before they become obvious
- Manage a team of hardware architects in defining server platform architectures that optimize for performance, reliability, cost, and speed of customer adoption
- Translate deep understanding of customer workloads (storage, AI/ML training, inference) into hardware design decisions
- Influence the broader AWS hardware strategy through data, conviction, and results

Design, Build & Test
- Own server platform development from architecture through detailed design, prototype, build, and qualification
- Manage a team of engineers responsible for design, build and launch of systems
- Lead ODM/JDM and design partner relationships, ensuring our requirements for performance, quality, testability, and diagnostics are met
- Drive design verification, system validation, and qualification — ensuring platforms meet reliability, performance, and cost targets before deployment
- Ensure systems are designed for operational excellence from day one — testability, diagnosability, and serviceability are built in, not bolted on

Deploy, Launch & Fleet Health
- Own deployment to the data center, launch readiness, and successful ramp into production
- Drive qualification and readiness milestones, removing technical and organizational blockers to get servers into the fleet
- Own fleet health beyond launch — your responsibility never ends. Monitor quality, reliability, and customer experience for the life of the platform
- Drive toward zero-touch operations — building automation infrastructure that detects, diagnoses, and remediates faults before customer impact
- Build predictive failure detection capabilities using telemetry, error trending, and log correlation
- Establish and track fleet health metrics (failure rates, MTTD, MTTR, first-time fix rate, predictive accuracy)
- Close the loop between field failures and design improvements in next-generation platforms

Team Leadership & Development
- Manage and grow a diverse team spanning hardware engineering, systems development, and technical program management
- Hire, develop, and retain top talent across multiple engineering disciplines
- Create an environment where engineers with fundamentally different expertise (hardware, firmware, software, program management) collaborate effectively and challenge each other
- Set clear goals, remove obstacles, and hold the team to high standards on delivery and quality
- Coach and develop senior technical leaders — help architects think bigger and help execution-focused engineers see the strategic picture

Cross-Organization Collaboration
- Partner with AWS service teams to ensure server platforms meet data path and control path requirements and drive fast adoption
- Work with supply chain, manufacturing, and datacenter operations teams to deliver at scale
- Influence peer teams and senior leadership on technical direction, investment priorities, and trade-offs
- Represent your team's work and roadmap to VP-level and above

About the team
This organization is responsible for designing, building, testing, launching and maintaining a fleet of AI/ML (GPU-based) servers and storage servers for Amazon's web services. Our engineers work with leading-edge technologies, solve challenging problems, influence the industry's roadmaps, and develop unique solutions that are ahead of the pack. We work in an environment that fosters innovation and creativity — we encourage and invest in new directions and ideas that serve our customers better.

The organization comprises Hardware Design Engineers, Systems Development Engineers, and Technical Program Managers, all with the common
goal of delivering the best storage and accelerator server fleet possible to our customers. We are located in Seattle and Cupertino, and we work with ODMs and Design Partners globally.

We own the full lifecycle of our server platforms: design, build, test, deploy to the data center, launch, and fleet health beyond launch. There is no hand-off — we are accountable from first architecture decision through every day the server runs in production.

Mostrar Mais

Salvar & Candidatar-se depois Applying Later... Click to ApplyI AppliedDidn't Apply

Confirmar seu email: Enviar Email

Candidatar-se à essa vaga

Próxima Vaga »

Todos os Empregos de Amazon.com

Vagas de emprego de 123 Amazon.com em Cupertino, CA Vagas de emprego de 6,375 Amazon.com em US