SDET - Software Eng Test, Alexa AI Developer Tech
Amazon.com
Alexa+: An AI assistant that gets things done, she's smarter, more conversational, and more capable. With 600 million Alexa devices now out in the world, the latest advancements in generative AI have unlocked new possibilities enabling us to reimagine the experience in our pursuit of making customers' lives better and easier every day. Experts enables Alexa+ to complete complex tasks using Amazon and third-party (3P) APIs and websites. Some of these include: playing your favorite music, control your smart devices, reserving a table at a restaurant, setting your morning alarms, kitchen timers, booking event tickets, scheduling an appointment and planning a trip. No more jumping between apps just say what you want, and it's done. Developers and partners have been core to our vision since the beginning. With Alexa+, we reimagined how developers can build for Alexa. Alexa AI developer Technology charter is to enable creation of experts at scale and make it seamless for 1P and 3P developers to integrate their businesses with Alexa+. We are measurably making Alexa+ smarter, and we need your help to define and build the next generation of capabilities using GenerativeAI.
Key job responsibilities
As an SDET driving the creation of next-generation LLM-based evaluation systems for Alexa+, you’ll design and build the frameworks that define how conversational intelligence is measured determining whether millions of daily interactions feel accurate, natural, and human-centered. Your work goes far beyond binary pass/fail tests: you’ll engineer automated systems that assess accuracy, reasoning depth, tone, and responsiveness across multimodal, context-rich conversations.
In this role, traditional testing boundaries dissolve. You’ll evaluate not just functional correctness, but whether the AI’s responses are contextually relevant, emotionally aligned, and conversationally fluent. Your systems will measure everything from factual accuracy and task completion to subtler attributes like dialog flow, personality coherence, and graceful curtailment. Partnering closely with scientists and engineers, you’ll automate the detection of conversational regressions identifying hallucinations, degraded reasoning, or misaligned tones before they reach customers. You’ll leverage prompt-driven evaluation pipelines, LLM-as-a-Judge (LLMaaJ) frameworks, and reference-based validation to ensure assessments remain consistent, explainable, and scalable across model versions and releases.
You’ll also collaborate with prompt engineers, model developers, and product teams to establish robust, category-specific testing methodologies—from quick one-shot actions and task fulfillment to multi-turn dialogues and creative, open-ended interactions. Through your work, evaluation evolves from a quality gate into a self-improving assessment frameworks —one that learns, adapts, and ensures every voice interaction feels naturally conversational.
Key job responsibilities
As an SDET driving the creation of next-generation LLM-based evaluation systems for Alexa+, you’ll design and build the frameworks that define how conversational intelligence is measured determining whether millions of daily interactions feel accurate, natural, and human-centered. Your work goes far beyond binary pass/fail tests: you’ll engineer automated systems that assess accuracy, reasoning depth, tone, and responsiveness across multimodal, context-rich conversations.
In this role, traditional testing boundaries dissolve. You’ll evaluate not just functional correctness, but whether the AI’s responses are contextually relevant, emotionally aligned, and conversationally fluent. Your systems will measure everything from factual accuracy and task completion to subtler attributes like dialog flow, personality coherence, and graceful curtailment. Partnering closely with scientists and engineers, you’ll automate the detection of conversational regressions identifying hallucinations, degraded reasoning, or misaligned tones before they reach customers. You’ll leverage prompt-driven evaluation pipelines, LLM-as-a-Judge (LLMaaJ) frameworks, and reference-based validation to ensure assessments remain consistent, explainable, and scalable across model versions and releases.
You’ll also collaborate with prompt engineers, model developers, and product teams to establish robust, category-specific testing methodologies—from quick one-shot actions and task fulfillment to multi-turn dialogues and creative, open-ended interactions. Through your work, evaluation evolves from a quality gate into a self-improving assessment frameworks —one that learns, adapts, and ensures every voice interaction feels naturally conversational.
Confirmar seu email: Enviar Email
Todos os Empregos de Amazon.com