Nashua, NH, 03061, USA
10 days ago
Senior Site Reliability Engineer
**Job Description** **Job Summary** Thank you for your interest in BAE Systems! We re looking for a seasoned Senior Site Reliability Engineer (SRE) to ensure the reliable deployment, operation, and continuous improvement of our digital engineering software tools across BAE Systems factories in North America. The role blends deep technical expertise with strong leadership, guiding crossfunctional teams to keep our missioncritical microservices, monitoring stacks, and data stores healthy and performant. **Key Responsibilities** + Monitor, troubleshoot, and resolve production incidents, ensuring rapid rootcause analysis and longterm fixes. + Design, build, and maintain automated deployment pipelines for the digital engineering software suite using asset/inventory management tools. + Deploy, configure, and operate the observability stack (Prometheus, Grafana, FluentBit, Loki) to provide realtime metrics, logs, and tracing for all services. + Monitor and troubleshoot PostgreSQL database health, performance, and replication issues; implement automated alerts and remediation. + Use Consul to servicediscover and healthcheck gRPC microservices; ensure service mesh reliability and failover handling. + Define and track SLIs/SLOs, error budgets, and reliability targets for each factory site; drive rootcause analysis and postmortems for incidents. + Lead incident response, oncall rotations, and runbooks; mentor junior engineers in debugging distributed systems. + Collaborate with software developers, factory operations, and external vendors to embed reliability into the software development lifecycle. + Evaluate emerging tools and technologies that can improve observability, automation, or performance while staying aligned with our onpremise strategy (no public cloud platforms). + Automate operational tasks and create selfservice tooling to reduce manual overhead. **Hybrid:** Because reliable operation often requires onsite collaboration with factory teams and access to physical infrastructure, the role will be primarily onsite at key manufacturing locations with the flexibility to work remotely for tasks that do not require direct interaction with factory hardware. **Required Education, Experience, & Skills** **Required Education, Experience, and Skills** + Bachelor's degree in Computer Science, Electrical Engineering, or related field + Minimum 4 years of experience in site reliability, DevOps, or systems engineering within a highvolume, multisite manufacturing or industrial environment. + Deep expertise in Windows systems, networking, and version-control workflows. + Experience with observability tools: Prometheus, Grafana, FluentBit, Loki. + Proficiency in automation/orchestration tools such as Ansible (or equivalent inventorymanagement solutions). + Strong scripting/programming skills (Python or similar) for building custom monitoring and remediation logic. + Excellent communication, problemsolving, and documentation abilities; comfortable working in a fastpaced, deadlinedriven environment. **Preferred Education, Experience, & Skills** **Preferred Education, Experience, and Skills** + Experience with Industry4.0 and digital transformation initiatives in manufacturing. + Prior work integrating onpremise monitoring stacks with microservice architectures. + Excellent communication, problemsolving, and documentation abilities; comfortable working in a fastpaced, deadlinedriven environment. + Experience monitoring and maintaining PostgreSQL databases in production. + Familiarity with servicediscovery and healthchecking using Consul, especially for gRPC services. + Strong grasp of data collection, management, and analysis, including: + Data collection and integration from various sources + Data management and storage solutions + Data analysis and visualization techniques + Data-driven decision-making and problem-solving **Pay Information** Full-Time Salary Range: $97008 - $164914 Please note: This range is based on our market pay structures. However, individual salaries are determined by a variety of factors including, but not limited to: business considerations, local market conditions, and internal equity, as well as candidate qualifications, such as skills, education, and experience. Employee Benefits: At BAE Systems, we support our employees in all aspects of their life, including their health and financial well-being. Regular employees scheduled to work 20 hours per week are offered: health, dental, and vision insurance; health savings accounts; a 401(k) savings plan; disability coverage; and life and accident insurance. We also have an employee assistance program, a legal plan, and other perks including discounts on things like home, auto, and pet insurance. Our leave programs include paid time off, paid holidays, as well as other types of leave, including paid parental, military, bereavement, and any applicable federal and state sick leave. Employees may participate in the company recognition program to receive monetary or non-monetary recognition awards. Other incentives may be available based on position level and/or job specifics. **Senior Site Reliability Engineer** **120557BR** EEO Career Site Equal Opportunity Employer. Minorities . females . veterans . individuals with disabilities . sexual orientation . gender identity . gender expression
Confirmar seu email: Enviar Email