Hyderabad, Telangana, India
1 day ago
Sustain Engineer
Overview This role is responsible for ensuring the overall stability of production application. Reliability, availability, scalability, and efficiency of our production systems and platforms. The Operations Engineer will collaborate with cross-functional teams—including Software Engineering, Service Reliability, Infrastructure, and Business Operations—to streamline processes, manage day to day operations, monitor system health, and quickly resolve incidents. The ideal candidate must be skilled in problem-solving, process automation, and root cause analysis, with a passion for operational excellence and continuous improvement. Responsibilities Monitor production systems, applications, and infrastructure to ensure high availability and performance. Troubleshoot and resolve operational issues, providing timely escalation and communication to stakeholders. Perform root cause analysis (RCA) and drive permanent fixes to recurring problems. Manage RTS & TTS, configuration changes, and production rollouts with minimal impact. Develop and maintain runbooks, standard operating procedures, and technical documentation. Automate operational workflows, monitoring, and reporting using scripts and tools. Collaborate with engineering teams to design for reliability, scalability, and operability. Support incident response, disaster recovery, and business continuity processes. Drive continuous improvement initiatives around system monitoring, alerting, and incident response. Ensure compliance with IT controls, security policies, and audit requirements. Qualifications Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field (or equivalent experience). 5+ years of experience in operations engineering, site reliability engineering, or systems administration. Strong knowledge of Linux/Unix and/or Windows server environments. Experience with monitoring and alerting tools (Grafana, Datadog, Splunk, Nagios). Proficiency in at least one scripting/programming language (e.g., Python, Bash, PowerShell). Familiarity with CI/CD pipelines, deployment automation, and configuration management (e.g., Jenkins). Understanding of networking fundamentals (DNS, TCP/IP, load balancing, firewalls). Hands-on experience with cloud platforms (AWS, Azure, GCP).
Confirmar seu email: Enviar Email