Plano, TX, United States
4 hours ago
Site Reliability Engineer III

There’s nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.

Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.
As a Site Reliability Engineer at JPMorgan Chase within the CORPORATE SECTOR, ENTERPRISE TECHNOLOGY team, you will be instrumental in enhancing intelligent and resilient platform operations for a global financial institution. You will drive the integration of traditional support with modern Site Reliability Engineering (SRE) principles, utilizing agentic AI as a core capability to achieve our vision of a proactive, automated, and customer-centric reliability function. This role demands a blend of deep technical expertise, a growth-oriented mindset, and a strong dedication to operational excellence. You will excel in modern infrastructure and observability, promoting AI-powered incident management, autonomous runbooks, and support intelligence initiatives.
 

Job responsibilities

Advocate and embody site reliability principles, fostering a culture of excellence and technical influence within your team.  Leverage AI tools to enhance operational effectiveness and automate processes, ensuring high-quality customer service. Spearhead projects aimed at enhancing the reliability and stability of applications and platforms. Utilize data-driven analytics and AI technologies to automate detection, diagnosis, resolution processes, elevate service levels and drive continuous improvement.  Engage stakeholders to establish realistic service level objectives and error budgets, ensuring alignment with customer expectations. Exhibit technical proficiency in one or more domains, proactively addressing technology-related bottlenecks.  Employ AI-driven solutions to streamline processes and enhance operational efficiency. Participate in troubleshooting during incidents, demonstrating the ability to swiftly identify and resolve issues to prevent financial losses.  Act as a culture carrier by documenting learnings and disseminating knowledge through internal forums and communities of practice. Mentor team members, guiding them in the strategic adoption of AI technologies to enhance operational effectiveness and customer service.
 

 

Required qualifications, capabilities, and skills

Formal training or certification on site reliability engineering concepts and 2+ years applied experience in areas such as resiliency, scalability, performance and security. Proven success in an SRE or DevOps role, with knowledge of service level indicators/objectives (SLIs/SLOs), incident management, blameless postmortem analysis, and systems reliability.  Expert with observability stacks (e.g., Prometheus, Grafana, Splunk, OpenTelemetry), including deep experience correlating telemetry across services and time. Hands-on skills in coding (at least one high-level programming language), cloud platforms (AWS or GCP), container orchestration (Kubernetes), infrastructure as code (Terraform), and resilient CI/CD pipelines. Active experience or deep curiosity in applying AI to operations—such as LLM-based copilots, anomaly detection, automated runbooks, autonomous agents (e.g. CrewAI, LangGraph), or Retrieval-Augmented Generation (RAG) workflows for support. A track record of delivering under pressure. You finish what you start, adapt to uncertainty, and thrive in high-accountability environments. You deconstruct complexity, organize effectively, and drive clarity into ambiguous operational environments. Documentation and design are second nature. Outstanding communication, empathy, and professionalism—especially during incidents. You recognize that great systems serve real people. 
  Preferred qualifications, capabilities, and skills Experience with operational and compliance rigor in banking, fintech, or similar. Practical use of LLM frameworks (e.g. LangChain, Semantic Kernel), AI orchestration tools, vector databases, or custom agents supporting reliability workflows. Experience with game days, chaos experiments, or failure-mode analysis to improve service robustness.  A background in mentoring engineers or leading technical knowledge-sharing, especially around AI and SRE best practices.
 
Confirmar seu email: Enviar Email