MLOps Engineer (Remote)
CDM Smith
Trinnex, a wholly owned subsidiary of CDM Smith is seeking a MLOps Engineer with specialization in AI platform to join our growing team. Trinnex is building next generation tools that integrate sensor/IoT data, models, geospatial data and machine learning to solve unique engineering and environmental issues.
**This position is based in Toronto, Ontario; candidates located in Vancouver, BC or Edmonton, Alberta may also be considered.**
In this role, you will own the operational backbone for our AI and Data Engineering products. You will be responsible for the end-to-end production lifecycle of our ML models, from helping build the application services that wrap them to creating the automated systems for their deployment. Your ultimate goal is to ensure the overall health, scalability, and reliability of these machine learning systems in production. This requires close collaboration with internal resources to research and implement MLOps best practices, driving continuous improvement and automation across our platforms.
Responsibilities:
• Design, build, and maintain scalable and reliable infrastructure to support the entire machine learning lifecycle, from experimentation and training to deployment and monitoring.
• Develop and manage robust CI/CD pipelines for ML models and associated software services, ensuring automated, high-quality releases.
• Collaborate closely with Data Scientists to containerize, deploy, and operationalize machine learning models, implementing solutions for both batch prediction and real-time inference use cases.
• Collaborate with teams to architect generative AI applications, providing expert guidance on connecting LLMs to proprietary data sources and enabling them to execute tasks on behalf of users.
• Champion MLOps best practices and empowers the Data Science team by providing guidance, training, and support for new tools and automated workflows.
• Partner with Software Engineers to define and implement modern service architectures, including microservices and APIs, for ML-powered applications.
• Implement and manage cloud infrastructure using Infrastructure as Code (IaC) principles to ensure environments are reproducible, secure, and auditable.
• Establish and maintain comprehensive monitoring, logging, and alerting systems to track model performance, data drift, and infrastructure health, and aid in incident response.
• Work with cybersecurity and architecture teams to design and enforce security best practices across our cloud environment, including network configuration, identity management, and data protection.
• Maintain clear and detailed documentation for MLOps processes, infrastructure, and best practices.
Skills and Abilities:
• Excellent software engineering fundamentals, with a solid understanding of modern software service architecture (e.g., microservices, APIs) and CI/CD principles.
• Deep, hands-on expertise with containerization (Docker) and container orchestration (Kubernetes).
• Proven experience designing, building, and securing infrastructure on a major cloud platform (e.g., GCP, AWS, Azure), with a firm grasp of core concepts like identity and access management (IAM) and secure network architecture, including VPCs, firewall policies, and segmentation.
• Demonstrable understanding of the end-to-end machine learning lifecycle and experience deploying models for both batch and real-time/live inference workloads.
• Experience working with and understanding the trade-offs between different data storage paradigms, such as relational databases (e.g., PostgreSQL), analytical data warehouses (e.g., BigQuery), and cloud object storage (e.g., GCS, S3).
• Solid understanding of Python.
• Excellent communication, interpersonal, and organizational skills, with a demonstrated ability to manage and prioritize multiple tasks effectively, both independently and as part of a team.
#LI-LP1
#LI-REMOTE
**This position is based in Toronto, Ontario; candidates located in Vancouver, BC or Edmonton, Alberta may also be considered.**
In this role, you will own the operational backbone for our AI and Data Engineering products. You will be responsible for the end-to-end production lifecycle of our ML models, from helping build the application services that wrap them to creating the automated systems for their deployment. Your ultimate goal is to ensure the overall health, scalability, and reliability of these machine learning systems in production. This requires close collaboration with internal resources to research and implement MLOps best practices, driving continuous improvement and automation across our platforms.
Responsibilities:
• Design, build, and maintain scalable and reliable infrastructure to support the entire machine learning lifecycle, from experimentation and training to deployment and monitoring.
• Develop and manage robust CI/CD pipelines for ML models and associated software services, ensuring automated, high-quality releases.
• Collaborate closely with Data Scientists to containerize, deploy, and operationalize machine learning models, implementing solutions for both batch prediction and real-time inference use cases.
• Collaborate with teams to architect generative AI applications, providing expert guidance on connecting LLMs to proprietary data sources and enabling them to execute tasks on behalf of users.
• Champion MLOps best practices and empowers the Data Science team by providing guidance, training, and support for new tools and automated workflows.
• Partner with Software Engineers to define and implement modern service architectures, including microservices and APIs, for ML-powered applications.
• Implement and manage cloud infrastructure using Infrastructure as Code (IaC) principles to ensure environments are reproducible, secure, and auditable.
• Establish and maintain comprehensive monitoring, logging, and alerting systems to track model performance, data drift, and infrastructure health, and aid in incident response.
• Work with cybersecurity and architecture teams to design and enforce security best practices across our cloud environment, including network configuration, identity management, and data protection.
• Maintain clear and detailed documentation for MLOps processes, infrastructure, and best practices.
Skills and Abilities:
• Excellent software engineering fundamentals, with a solid understanding of modern software service architecture (e.g., microservices, APIs) and CI/CD principles.
• Deep, hands-on expertise with containerization (Docker) and container orchestration (Kubernetes).
• Proven experience designing, building, and securing infrastructure on a major cloud platform (e.g., GCP, AWS, Azure), with a firm grasp of core concepts like identity and access management (IAM) and secure network architecture, including VPCs, firewall policies, and segmentation.
• Demonstrable understanding of the end-to-end machine learning lifecycle and experience deploying models for both batch and real-time/live inference workloads.
• Experience working with and understanding the trade-offs between different data storage paradigms, such as relational databases (e.g., PostgreSQL), analytical data warehouses (e.g., BigQuery), and cloud object storage (e.g., GCS, S3).
• Solid understanding of Python.
• Excellent communication, interpersonal, and organizational skills, with a demonstrated ability to manage and prioritize multiple tasks effectively, both independently and as part of a team.
#LI-LP1
#LI-REMOTE
Confirmar seu email: Enviar Email
Todos os Empregos de CDM Smith