Chennai, India
11 days ago
ML Ops & Observability Engineer

Use Your Power for Purpose

At Pfizer, technology drives everything we do. You will play a pivotal role in implementing impactful and innovative technology solutions across all functions, from research to manufacturing. Whether you are digitizing drug discovery and development, identifying innovative solutions, or streamlining our processes, you will be making a significant impact on countless lives. 

What You Will Achieve

MLOps Platform Execution & Model Operations

Lead the design, implementation, and operation of MLOps platforms supporting model development, deployment, monitoring, and lifecycle management.

Own production workflows for:

Model packaging and deployment

Versioning and rollback

Promotion across environments (dev/test/prod)

Implement standardized CI/CD pipelines for ML workloads, integrating with enterprise DevOps and infrastructure platforms.

Partner with infrastructure and DataOps teams to ensure ML workloads run on secure, scalable, and cost-effective cloud-native environments (AWS/Azure).

Translate Director-level AI platform strategy into reliable, repeatable ML operational capabilities.

Model, Data & System Observability

Own end-to-end observability for ML systems, spanning:

Model performance and behavior

Data quality and drift

Pipeline health and system reliability

Implement and operate observability tooling using:

OpenTelemetry for distributed tracing

Metrics and dashboards (Prometheus, Grafana)

Logs and analytics (ELK or equivalent)

Define and track ML-specific reliability signals, such as:

Model performance degradation

Data drift and feature anomalies

Prediction latency and failure rates

Establish SLOs and alerting strategies for ML services in production.

Testing, Validation & Responsible AI Enablement

Ensure testing and validation are embedded throughout the ML lifecycle, including:

Model validation and regression testing

Data and feature consistency checks

Deployment verification and rollback testing

Integrate automated ML testing and quality gates into CI/CD pipelines.

Support non-functional testing for ML systems, including:

Performance and scalability testing

Reliability and resilience testing

Security and access validation

Partner with AI, data, and compliance teams to support responsible and compliant AI operations, including auditability, traceability, and explainability hooks (where required).

AI Platform Enablement & Cross‑Team Collaboration

Enable data scientists and ML engineers to move models from experimentation to production efficiently and safely.

Provide reusable tooling, templates, and paved paths for:

Experiment tracking

Model registry usage

Deployment and monitoring patterns

Collaborate closely with:

Infrastructure Engineering (runtime, scaling, security)

DataOps Engineering (data pipelines, feature stores, data quality)

Product and analytics leaders to align ML capabilities to business outcomes.

Reliability, Incident Management & Continuous Improvement

Own operational reliability for ML platforms and services.

Lead response to ML-related production incidents, including:

Model failures or degradations

Data drift–driven issues

Pipeline or inference outages

Conduct post-incident reviews and drive systemic improvements.

Continuously improve MLOps maturity using SRE-inspired practices and metrics.

People Leadership & Engineering Ways of Working

Set clear expectations for operational ownership, quality, and delivery.

Coach engineers on:

MLOps best practices

Observability and reliability mindset

Secure and compliant AI operations

Establish strong engineering discipline through design reviews, runbooks, documentation, and continuous learning.

Act as the primary execution partner to the Director-level Commercial AI Analytics Solutions & Engineering Lead for ML operations and observability.

Here Is What You Need (Minimum Requirements)

8+ years of experience in ML engineering, MLOps, platform engineering, or related roles, with 3+ years of people leadership.

Strong hands-on experience operationalizing ML systems in AWS or Azure environments.

Proven expertise in:

MLOps pipelines and tooling (experiment tracking, model registry, deployment, monitoring)

CI/CD for ML workloads (e.g., GitHub Actions or equivalent)

Containerized and cloud-native ML runtimes

Solid understanding of testing and validation for ML systems, including:

Model regression and performance testing

Data and feature validation

Deployment and rollback verification

Strong experience implementing observability and reliability practices using tools such as OpenTelemetry, Prometheus, Grafana, and ELK.

Demonstrated experience with DevSecOps and secure SDLC for AI/ML systems, including secrets management and access controls.

Proficiency in programming and scripting (e.g., Python, Bash, SQL; familiarity with ML frameworks).

Strong communication and collaboration skills; ability to deliver outcomes through teams and influence cross-functionally.

Bonus Points If You Have (Preferred Requirements)

Master's degree in Computer Science, Data Science, AI/ML, or related field.

Experience with MLOps platforms and tools (e.g., MLflow, Kubeflow, feature stores).

Background in data drift detection, model monitoring, and ML reliability engineering.

Familiarity with responsible AI, governance, or regulated environments.

Relevant certifications:

AWS/Azure Professional

o   Kubernetes (CKA/CKAD)

Cloud security or data/AI platform certifications

 
Work Location Assignment: Hybrid

Pfizer is an equal opportunity employer and complies with all applicable equal employment opportunity legislation in each jurisdiction in which it operates.

Information & Business Tech

Confirmar seu email: Enviar Email