Bangalore, Karnataka
18 hours ago
Engineer

Company:Qualcomm India Private Limited

Job Area:Engineering Group, Engineering Group > Software Test Engineering

General Summary:

Site Reliability Engineer - AI Infrastructure

About the Role:

Site Reliability Engineering (SRE) is a specialized engineering discipline that combines software development and systems engineering to build and maintain large-scale, highly reliable production systems. At Qualcomm, our SRE team plays a critical role in ensuring our AI infrastructure services deliver maximum reliability, performance, and uptime while enabling rapid innovation and deployment of cutting-edge AI solutions.

As an SRE in our AI Infrastructure team, you will be at the forefront of deploying and managing cloud-based AI inference systems. You'll work with state-of-the-art hardware accelerators and container orchestration platforms to deliver high-performance, scalable AI services. This role demands a unique blend of strong software engineering skills, particularly in Python, deep systems knowledge, and expertise in modern cloud-native technologies.

Our SRE culture emphasizes automation, proactive problem-solving, triaging activities, and continuous improvement. We value intellectual curiosity, collaboration, and a systematic approach to tackling complex distributed systems challenges. 

You'll work with a diverse team of engineers who bring varied backgrounds and perspectives to solve some of the most interesting problems in AI infrastructure.

What You'll Be Doing:

Design, implement, and maintain large-scale Kubernetes clusters optimized for AI inference workloads, with focus on performance, reliability, and scalability across cloud environmentsDeploy and manage containerized AI services using Docker, Kubernetes, and KServe (or similar ML serving platforms), ensuring high availability and optimal resource utilizationWrite production-quality Python code to build automation tools, frameworks, and infrastructure management solutions that eliminate manual processes and improve operational e iciencyLead triaging e orts for complex production incidents, performing deep-dive analysis to identify root causes and implement permanent fixesDebug sophisticated deployment scenarios at multiple levels - from application layer through container orchestration to Linux OS and hardware interfacesSupport the full lifecycle of AI inference services - from design and capacity planning through deployment, operation, optimization, and continuous refinementDevelop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible, or similar technologies to ensure reproducible and version-controlled infrastructureCollaborate with ML engineers, software developers, and infrastructure teams to optimize AI workload deployment and performance

What We Need to See:

Bachelor's or Master's degree in Computer Science, Engineering, or related technical field, or equivalent practical experience1+ years of experience in Site Reliability Engineering, DevOps, Systems Engineering, or Software Development with focus on production systemsCertifications in Kubernetes (CKA, CKAD), cloud platforms (AWS/Azure/GCP), or related technologies will be an added advantage.Strong proficiency in Python programming with demonstrated ability to write clean, maintainable, and e         icient code for automation and toolingDeep expertise in Linux/Unix systems administration, including kernel concepts, system calls, networking stack, storage systems, and performance tuningHands-on experience with Kubernetes in production environments, including cluster management, workload orchestration, networking (CNI), storage (CSI), and troubleshootingSolid understanding of containerization technologies (Docker, containerd) and container orchestration patternsExperience with cloud platforms (AWS, Azure, GCP, or private cloud) and cloud-native architecturesProven track record of triaging and resolving complex production issues under pressure, with strong analytical and debugging skillsExperience with monitoring and observability tools such as Prometheus, Grafana, ELK Stack, or similar platformsStrong understanding of networking concepts including TCP/IP, DNS, load balancing, and service mesh architecturesExcellent problem-solving abilities with systematic approach to root cause analysisStrong communication skills with ability to explain technical concepts to both technical and non-technical audiences

Ways to Stand Out from the Crowd:

Experience deploying and managing AI/ML inference systems or model serving platforms (KServe, TorchServe, TensorFlow Serving, Triton Inference Server)Knowledge of AI hardware accelerators (GPUs, TPUs, or specialized AI chips) and their integration in cloud environmentsFamiliarity with AI/ML frameworks such as PyTorch, TensorFlow, or ONNXKnowledge of security best practices for containerized environments and cloud infrastructureContributions to open-source projects related to Kubernetes, cloud-native technologies, or SRE toolingExperience with capacity planning and performance optimization for high-throughput systemsBackground in implementing and maintaining disaster recovery and business continuity solutions

Key Competencies:

Automation-First Mindset: Passion for eliminating repetitive manual work through intelligent automationSystems Thinking: Ability to understand how complex distributed systems interact and impact each otherOwnership and Accountability: Taking end-to-end responsibility for services and their

reliability

Continuous Learning: Growth mindset with eagerness to learn new technologies and methodologiesCollaboration: Ability to work e      ectively in global, cross-functional teamsResilience Under Pressure: Maintaining composure and e ectiveness during critical incidentsAttention to Detail: Thoroughness in implementation, testing, and documentation

What We O     er:

Opportunity to work on cutting-edge AI infrastructure at scaleCollaborative environment with world-class engineersContinuous learning and professional development opportunitiesExposure to latest technologies in cloud computing, AI/ML, and distributed systemsCompetitive compensation and benefits package

•Qualcomm is an equal opportunity employer committed to diversity and inclusion in the workplace.

Minimum Qualifications:

• Bachelor's degree in Engineering, Information Systems, Computer Science, or related field.

Applicants: Qualcomm is an equal opportunity employer. If you are an individual with a disability and need an accommodation during the application/hiring process, rest assured that Qualcomm is committed to providing an accessible process. You may e-mail disability-accomodations@qualcomm.com or call Qualcomm's toll-free number found here. Upon request, Qualcomm will provide reasonable accommodations to support individuals with disabilities to be able participate in the hiring process. Qualcomm is also committed to making our workplace accessible for individuals with disabilities. (Keep in mind that this email address is used to provide reasonable accommodations for individuals with disabilities. We will not respond here to requests for updates on applications or resume inquiries).

Qualcomm expects its employees to abide by all applicable policies and procedures, including but not limited to security and other requirements regarding protection of Company confidential information and other confidential and/or proprietary information, to the extent those requirements are permissible under applicable law.

To all Staffing and Recruiting Agencies: Our Careers Site is only for individuals seeking a job at Qualcomm. Staffing and recruiting agencies and individuals being represented by an agency are not authorized to use this site or to submit profiles, applications or resumes, and any such submissions will be considered unsolicited. Qualcomm does not accept unsolicited resumes or applications from agencies. Please do not forward resumes to our jobs alias, Qualcomm employees or any other company location. Qualcomm is not responsible for any fees related to unsolicited resumes/applications.

If you would like more information about this role, please contact Qualcomm Careers.

Confirmar seu email: Enviar Email