There’s nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.
As a Site Reliability Engineer III at JPMorgan Chase within the AIML Data Platforms and Chief Data and Analytics Team, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform.
Job responsibilities
Assists in operating and maintaining the managed AWS and Data platforms; provides day-to-day engineering and operational support to SRE and application teams under guidance.Supports platform design, setup, and configuration; performs workspace administration, resource monitoring, and basic troubleshooting for data engineering, Data Science/ML, and application/integration teams.Participates in evaluation activities with external vendors, startups, and internal teams; documents findings and recommendations for senior review.Contributes to improvements in system observability, alerting, and capacity planning by building dashboards, updating runbooks, and implementing basic automation.Collaborates with engineering and data teams to optimize infrastructure and deployment processes, focusing on automation and operational excellence; writes and maintains scripts or pipelines following standards.Implements and troubleshoots software solutions; contributes to design and development tasks and escalates complex issues appropriately.Writes secure, high-quality production code for features and fixes; performs basic peer reviews and debugs own code when needed.Identifies recurring issues and proposes or implements automation and remediation steps to improve operational stability of applications and systems.Contributes to a team culture of inclusion, respect, and continuous learning.Applies Site Reliability Engineering best practices (e.g., SLIs/SLOs, error budgets, incident response) with direction from senior engineers to support reliability, scalability, and performance of data platforms.Participates in incident response following established procedures; assists with root-cause analysis, postmortem documentation, and implementation of corrective actions.Required qualifications, capabilities, and skills
Formal training or certification on software engineering concepts and applied experienceProficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platformExperience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and othersUnderstanding of SRE principles, including SLIs, SLOs, error budgets, and incident management.Experience with monitoring tools, automation frameworks, and CI/CD pipelines.Experience writing Python applications or scripts and using automated unit testing frameworks.Experience with terraform development and understanding of terraform enterprise.Experience contributing to system design discussions, application development, testing, and supporting operational stability.Familiarity with big data distributed compute frameworks such as Apache Spark, AWS Glue, and MapReduce.Strong troubleshooting, analytical, and communication skills. Preferred qualifications, capabilities, and skillsFamiliarity with distributed systems and large-scale data processing.Experienced with AWS and PythonKnowledge of containerization (Docker, Kubernetes) and orchestration.