Azcapotzalco, Ciudad de México, Mexico
23 hours ago
Platform Site Reliability Engineer

If you’re looking for a career where you can make a real impression, join Global Service Center (GSC) HSBC and discover how valued you’ll be. HSBC is one of the largest banking and financial services organisations in the world, with operations in 64 countries and territories. We aim to be where the growth is, enabling businesses to thrive and economies to prosper, and, ultimately, helping people to fulfil their hopes and realise their ambitions.

We are currently seeking an experienced professional to join our team in the role of Platform Site Reliability Engineer

Role Purpose:

This role sits within the Wealth and Personal Banking CTO Engineering Foundation group. We are driving innovation using Cloud technologies by designing, building, and operating mission critical shared service platforms hosting the APIs and Microservices which underpin the banks’ digital products and services.

We are currently seeking an experienced individual to join our team in the role of Platform Site Reliability Engineer.

Main Activities:

As an SRE you will have responsibility to:

Ensure the availability and maintainability of our large-scale API and Microservices platform located across three points of presence in HK, UK, and the US.Continuously improve the reliability, capacity, and performance of our platforms by applying SRE principles and practices to drive scale, enhance observability, reduce toil, more accurately measure risk, and more safely enable business driven change.Elevate our expertise and maturity in safely managing our core technology stack underpinned by AWS, Kubernetes, Kong API gateway, Istio Service Mesh, and a host of supporting services in a hybrid hosting environment (i.e., private/public cloud on-prem).Develop best in class observability tools and techniques enabling monitoring and alerting capability which facilitate not only incident detection and response, but also capacity management, improved release safety, and greater resource efficiency. Investigate, triage, and resolve production incidents and use data to articulate impact with relentless attention to the technical signals and underlying root causes that enable remediation and future avoidance/mitigation.Contribute to the design and engineering of auto and self-healing capability for known failure modes across our platforms.Contribute code to our platform repositories enabling not only our reliability agenda (e.g., monitoring-as-code), but also higher release speed and safety, simpler tenant onboarding, and improved controls. Author, contribute, and maintain our evolving knowledge base including support and operational runbooks, platform tenant guides, and onboarding and release documentation with an underlying goal of driving as much best practice and self-service as possible.Participate in regular SRE on-call rota supporting a 24/7/365 support model across our mission critical platforms within a large banking eco-system of front-end, middleware, and back-end fulfilment systems.To be a successful SRE in our team you will:Be fluent in written and spoken English and be comfortable working in a multi-cultural and diverse organization with team members across the globe.Value effective and continual communications, honesty, transparency, and accountabilityValue failure as an opportunity and an investment in more reliable systems (Blameless post-mortem culture).Possess fundamentals and evidence-based problem-solving skills; Drive decision-making by function, first principles-based mind-set. Demonstrate a bias-to-action and avoid analysis-paralysis, maintain a sense of ownership as you drive actions to the finish line with high quality and on time.Be ego-less when searching for the best ideas and contribute effectively outside of your specialty; You think about solving problems from the standpoint of best outcome for the team.Have strong fundamental knowledge in distributed systems and networking.Possess programming experience in at least one of the following languages: Python, Java, Go, Ruby, Bash scripting.Have the ability to debug and optimise code, while automating routine tasks (i.e., TOIL reduction)Have a strong background in the setup, use, and optimisation of a variety of observability tools including Splunk, DataDog, AppDynamics, and Cloudwatch.Understand the concepts of quantifying failure and availability in a prescriptive manner using SLOs, SLIs, and Error Budgets
Confirmar seu email: Enviar Email