Lead Engineer for Manufacturing and Datacenter Lab, Trainium Manufacturing, Quality and Reliability
Amazon.com
Within the Trainium Manufacturing Quality & Reliability (TRN MQR) organization, we are establishing a critical new function that bridges manufacturing outcomes with datacenter operational performance. We are seeking a talented and motivated Manufacturing & Datacenter Preparedness Lab Leader to build and lead this strategic capability in Austin, Texas.
This role will report to the leader of Trainium Manufacturing Quality & Reliability and serve as the essential feedback loop between our ODM/JDM/CM manufacturing operations and AWS datacenter fleet performance. You will establish and operate a specialized preparedness lab focused on analyzing datacenter performance of manufactured Trainium systems to identify root causes of field rework and repairs, feeding critical insights back into manufacturing processes, test strategies, and design improvements.
You will participate in the early phase of manufacturing line development for our next generation servers and racks to improve our manufacturing flows informing system design, manufacturing, and fleet operations. You will manage early lifecycle changes, identify initial product quality improvements, and drive to technical root cause in supplier quality activities. The candidate will have experience in design or manufacturing and is capable of making wide-ranging business decisions on behalf of the organization.
You'll join a diverse team working across Manufacturing Engineering, Manufacturing Test Engineering, and Quality & Reliability Engineering. You'll collaborate with people across AWS Data Center Engineering, Hardware Design, ODM/JDM/CM partners, and datacenter operations teams to help us deliver the highest standards for safety and reliability while providing seemingly infinite capacity at the lowest possible cost for our customers. And you'll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion.
Key job responsibilities
- Own operational production performance of Trainium systems across entire product lifecycle from manufacturing through datacenter deployment and fleet operations
- Design and build preparedness lab replicating datacenter conditions for assembly, repair and system testing
- Define and drive assembly and repair recipes in the manufacturing lab as the baseline prior to high volume manufacturing and datacenter deployment.
- Ensure all manufacturing and datacenter test flows are regressed in the manufacturing lab prior to deployment.
- Influence hardware design strategy for Design for Manufacturing (DFM), Design for Reliability (DFR), and Design for Test (DFT) based on field failure analysis
- Establish data-driven analytics frameworks connecting manufacturing test data to datacenter performance, leveraging ML techniques to predict field failures
- Build and mentor cross-functional team spanning manufacturing, test, quality, and reliability engineering; perform technical promotion assessments as force multiplier
- Collaborate with AWS datacenter operations teams to understand failure modes, repair patterns, and operational challenges firsthand; translate operator insights and field learnings into actionable manufacturing process improvements and design changes
- Drive continuous improvement reducing failure rates and lifecycle degradation through rapid feedback loops
- Develop or adapt manufacturing process at the ODM and CM, including defining fixture requirements, critical assembly requirements, test methodology, signal integrity, power and heat management requirement
About the team
Annapurna Labs is a wholly owned subsidiary of AWS, focused on developing custom silicon and servers including the Nitro(K2), Graviton, Inferentia, and Trainium families of processors.
Machine Learning Annapurna functions as a vertically integrated team including software, firmware, hardware, and silicon design in a single organization.
We are the Trainium Servers and Systems organization under MLA focused on Hardware Development, Software Development, Fleet Ops Systems, and Manufacturing, Quality, and Reliability.
This position is in the Manufacturing, Quality and Reliability team.
This role will report to the leader of Trainium Manufacturing Quality & Reliability and serve as the essential feedback loop between our ODM/JDM/CM manufacturing operations and AWS datacenter fleet performance. You will establish and operate a specialized preparedness lab focused on analyzing datacenter performance of manufactured Trainium systems to identify root causes of field rework and repairs, feeding critical insights back into manufacturing processes, test strategies, and design improvements.
You will participate in the early phase of manufacturing line development for our next generation servers and racks to improve our manufacturing flows informing system design, manufacturing, and fleet operations. You will manage early lifecycle changes, identify initial product quality improvements, and drive to technical root cause in supplier quality activities. The candidate will have experience in design or manufacturing and is capable of making wide-ranging business decisions on behalf of the organization.
You'll join a diverse team working across Manufacturing Engineering, Manufacturing Test Engineering, and Quality & Reliability Engineering. You'll collaborate with people across AWS Data Center Engineering, Hardware Design, ODM/JDM/CM partners, and datacenter operations teams to help us deliver the highest standards for safety and reliability while providing seemingly infinite capacity at the lowest possible cost for our customers. And you'll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion.
Key job responsibilities
- Own operational production performance of Trainium systems across entire product lifecycle from manufacturing through datacenter deployment and fleet operations
- Design and build preparedness lab replicating datacenter conditions for assembly, repair and system testing
- Define and drive assembly and repair recipes in the manufacturing lab as the baseline prior to high volume manufacturing and datacenter deployment.
- Ensure all manufacturing and datacenter test flows are regressed in the manufacturing lab prior to deployment.
- Influence hardware design strategy for Design for Manufacturing (DFM), Design for Reliability (DFR), and Design for Test (DFT) based on field failure analysis
- Establish data-driven analytics frameworks connecting manufacturing test data to datacenter performance, leveraging ML techniques to predict field failures
- Build and mentor cross-functional team spanning manufacturing, test, quality, and reliability engineering; perform technical promotion assessments as force multiplier
- Collaborate with AWS datacenter operations teams to understand failure modes, repair patterns, and operational challenges firsthand; translate operator insights and field learnings into actionable manufacturing process improvements and design changes
- Drive continuous improvement reducing failure rates and lifecycle degradation through rapid feedback loops
- Develop or adapt manufacturing process at the ODM and CM, including defining fixture requirements, critical assembly requirements, test methodology, signal integrity, power and heat management requirement
About the team
Annapurna Labs is a wholly owned subsidiary of AWS, focused on developing custom silicon and servers including the Nitro(K2), Graviton, Inferentia, and Trainium families of processors.
Machine Learning Annapurna functions as a vertically integrated team including software, firmware, hardware, and silicon design in a single organization.
We are the Trainium Servers and Systems organization under MLA focused on Hardware Development, Software Development, Fleet Ops Systems, and Manufacturing, Quality, and Reliability.
This position is in the Manufacturing, Quality and Reliability team.
Confirmar seu email: Enviar Email
Todos os Empregos de Amazon.com