Sr. Technical Product Manager AI/ML Training, Annapurna Labs
Amazon.com
AWS Neuron is looking for an experienced Technical Product Manager to define and drive product strategy for ML training software. You will be part of the AWS Neuron Product Management team, driving innovation in machine learning training acceleration. AWS Neuron is the software stack for Trainium and Inferentia, the AWS Machine Learning chips, delivering best-in-class ML training performance in the cloud. You will lead training software requirements working backward from customer needs, drive training frameworks, and collaborate with open source communities and ML ecosystem partners, enabling customers to successfully develop and optimize ML training workloads on AWS Trainium through deep understanding of distributed training, compilation systems, and hardware acceleration.
The ideal candidate will have a solid understanding of AI/ML models training, distributed training architectures, and performance optimization techniques. They should be able to assess technical implications of training software stack decisions, understand customer needs, and drive developer experience. Experience with large-scale distributed training, model parallelism strategies, and hardware acceleration is valuable.
Additionally, the ideal candidate should have:
* Proven track record of driving training product strategy and owning roadmap definition in complex technical environments
* Experience delivering training features with deep understanding of technical trade-offs and product implications
* Strong ability to contribute to and influence engineering discussions around training technology decisions and strategy
* Demonstrated success in representing customer training needs and driving alignment during executive-level prioritization
* Track record of delivering results in fast-paced, ambiguous environments, particularly in early-stage programs
* Experience with modern ML training workflows and collaborative open-source projects
* Experience with model parallel training techniques including tensor, pipeline, and sequence parallelism
Key job responsibilities
* Drive and execute training product strategy and roadmap working backwards from customer requirements in collaboration with engineering technical leadership
* Assess technical implications of training architecture and optimization decisions
* Drive technical alignment across Neuron training components, distributed training workflows and dependencies
* Work directly with software engineering teams to define and execute on new training features
* Produce clear and concise documents such as PRFAQ and PRD documents for training capabilities
* Write user stories, and validate training features meet developer needs
* Drive feature discussions with customers, engineering, and other stakeholders around training use cases
* Anticipate bottlenecks in training workflows, manage risk and escalations, balancing technical constraints
* Find opportunities to innovate on behalf of our training customers, design features related to these opportunities
* Build ecosystem partnerships focused on training and stay connected with industry trends
* Represent the training product in relevant industry events
About the team
About AWS Neuron:
AWS Neuron is the software of Trainium and Inferentia, the AWS Machine Learning chips. Inferentia delivers best-in-class ML inference performance at the lowest cost in the cloud to our AWS customers. Trainium is designed to deliver the best-in-class ML training performance at the lowest training cost in the cloud, and it’s all being enabled by AWS Neuron. Neuron is a Software that include ML compiler and native integration into popular ML frameworks. Our products are being used at scale with external customers like Anthropic and Databricks as well as internal customers like Alexa, Amazon Bedrocks, Amazon's Rufus AI assistant, Amazon Robotics, Amazon Ads, Amazon Rekognition and many more.
The ideal candidate will have a solid understanding of AI/ML models training, distributed training architectures, and performance optimization techniques. They should be able to assess technical implications of training software stack decisions, understand customer needs, and drive developer experience. Experience with large-scale distributed training, model parallelism strategies, and hardware acceleration is valuable.
Additionally, the ideal candidate should have:
* Proven track record of driving training product strategy and owning roadmap definition in complex technical environments
* Experience delivering training features with deep understanding of technical trade-offs and product implications
* Strong ability to contribute to and influence engineering discussions around training technology decisions and strategy
* Demonstrated success in representing customer training needs and driving alignment during executive-level prioritization
* Track record of delivering results in fast-paced, ambiguous environments, particularly in early-stage programs
* Experience with modern ML training workflows and collaborative open-source projects
* Experience with model parallel training techniques including tensor, pipeline, and sequence parallelism
Key job responsibilities
* Drive and execute training product strategy and roadmap working backwards from customer requirements in collaboration with engineering technical leadership
* Assess technical implications of training architecture and optimization decisions
* Drive technical alignment across Neuron training components, distributed training workflows and dependencies
* Work directly with software engineering teams to define and execute on new training features
* Produce clear and concise documents such as PRFAQ and PRD documents for training capabilities
* Write user stories, and validate training features meet developer needs
* Drive feature discussions with customers, engineering, and other stakeholders around training use cases
* Anticipate bottlenecks in training workflows, manage risk and escalations, balancing technical constraints
* Find opportunities to innovate on behalf of our training customers, design features related to these opportunities
* Build ecosystem partnerships focused on training and stay connected with industry trends
* Represent the training product in relevant industry events
About the team
About AWS Neuron:
AWS Neuron is the software of Trainium and Inferentia, the AWS Machine Learning chips. Inferentia delivers best-in-class ML inference performance at the lowest cost in the cloud to our AWS customers. Trainium is designed to deliver the best-in-class ML training performance at the lowest training cost in the cloud, and it’s all being enabled by AWS Neuron. Neuron is a Software that include ML compiler and native integration into popular ML frameworks. Our products are being used at scale with external customers like Anthropic and Databricks as well as internal customers like Alexa, Amazon Bedrocks, Amazon's Rufus AI assistant, Amazon Robotics, Amazon Ads, Amazon Rekognition and many more.
Confirmar seu email: Enviar Email
Todos os Empregos de Amazon.com