Within the Oracle Health (OHAI) organization, the new EHR and Clinical AI Agent (CAA) cloud services are at the forefront of new generative AI services for healthcare organizations. Building on the success of the established Oracle Digital Assistant (ODA) product, EHR and CAA enable healthcare providers to leverage advanced AI technologies, together with voice commands, to reduce manual work and enable providers to focus on patient care.
Oracle Health EHR and CAA are expanding their Oracle Cloud Infrastructure (OCI) Operations teams and looking to bring in new Site Reliability Engineers. As an SRE engineer, you will be engaged in solving technical challenges on an advanced OCI cloud service platform, focusing on areas such as reliability, scalability, resilience, security, and performance.
You will define how to use latest technologies to optimize the operational efficiency of the service. You will gain a deep understanding of ChatBots, cognitive services, machine learning and analytics. You will work with a team pushing the boundaries of a scalable, self-healing, autonomous platform built on Kubernetes, Docker, Prometheus, and Grafana. You will be exposed to a wide range of OCI cloud services and understand how we interact with many dependent services across the organization.
Areas of responsibility
Service OwnershipAs part of the EHR/CAA team, you will:
Be responsible for all operational aspects of the OCI services included in our portfolio. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of the EHR and CAA products. Own end-to-end availability, reliability, and performance of a Cloud Service Participate in LiveSite operations, working rapidly to mitigate issues that may arises Service Design Designing and implement solutions for rolling out software and security updates with zero downtime Partner with development and product management to build and maintain platform and automation frameworks to ensure maximum up-time and predictability, preventing outages and service interruptions or degradation Analyze system failures and develop rapid response processes Operations engineering Evaluate the operation of cloud service deployments across commercial and government datacenters Monitor the degradation of the service and dependencies under load, and implement solutions to ensure high availability to our customers Analyse resource utilization and scaling requirements in a high-end production system Resolve security vulnerabilities to conform to corporate and government security standards. Automation Building on your understanding of automation and orchestration principles, you will be identifying opportunities to automate SRE procedures in production environments The solution implemented will be designed to minimize the possibility of errors being introduced into the system Technical expertise Handle complex, critical issues encountered in production environments, drawing on your accumulated technical knowledge to rapidly identify the issues and apply steps to mitigate. Develop an understanding of the underlying AI technologies used to implement the EHR and CAA services We're looking for an SRE/DEVOPS/Platform Engineer to work with the Developer Experience team. You will get a chance to work with cutting edge technology and facilitate the development of AI/ML platforms. Qualified candidates will have these skills: Cloud Computing (OCI, Azure, AWS), IaC (Terraform), Kubernetes, Coding (Python, Java). You will work with the Developer Experience (DevEx) team on the shared full stack ownership of a collection of services across a variety of technology areas. You will be responsible for understanding the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services. You will help engineers to deploy their software to production (in datacenters spanning the globe) using the best practices around CI/CD, Monitoring/Alerting, HA/DR. Requirements Minimum 5 years of hands-on Platform Engineering, DevOps or SRE experience BS or MS in Computer Science, Computer Engineering, or equivalent Excellent team skills, can-do attitude, focus on quality. Technical role with a history of embracing automated processes, cloud native application design principles and a CI/CD DevOps model. Strong trouble shooting capabilities targeting complicated problems in remote systems Experience with production operations and best practices for deploying quality code in production and troubleshooting issues when they arise. Experience with public cloud (OCI, AWS, GCP, Azure). Knowledge of Infrastructure as Code (IaC), Configuration as Code (CaC), GitOps and tools such as Terraform, Argo CD, Flux, etc. Experience and working knowledge in languages like Python or Java. Experience deploying, configuring, managing and debugging cloud infrastructure and platform software such as OpenStack, Kubernetes, etc. Experience with public cloud managed Kubernetes (such as OCI/OKE, AWS/EKS, GCP/GKE, Azure/AKS). Experience with cloud-native administration and monitoring/alerting technologies such as Docker, Helm, Prometheus, Grafana, EFK/ELK, Jaeger, or similar technologies. Experience designing and implementing CI/CD pipelines, platforms and components such as Jenkins, Argo CD. Knowledge of version control using Git. Experience in Linux/Unix environment Experience with application frameworks such as Spring, Helidon, Micronaut, etc. is a plus. Experience developing or designing healthcare software is a plus. Experience working in Agile/Scrum development process is a plus. Experience working with MLOps/AIOps tooling is a plus. Must be eligible to obtain & maintain a US government security clearance appropriate for this role, which requires you to be a US Citizen.