With a career at The Home Depot, you can be yourself and also be part of something bigger.
Position Purpose:
The Staff Reliability Engineer – Observability is responsible for leading the design, implementation, and evolution of observability solutions that ensure the reliability, performance, and efficiency of our systems. As a Staff Reliability Engineer, you will be part of a dynamic team with engineers of all experience levels who help each other build and grow technical and leadership skills while creating, deploying, and supporting production applications.
As a Staff Reliability Engineer, you are expected to build and grow the skillsets of the more junior Engineers.
Key Responsibilities:
- 50% Delivery and Execution - Develops, tests, deploys, and maintains software, with a clear understanding of the value the software is to provide; Takes a broad view when approaching issues; using a global lens; Consistently achieves results, even under tough circumstances; Develops test suites (functional, destructive, etc) to enable success, rapid deployment of code to production; Takes on new opportunities and tough challenges with a sense of urgency, high energy and enthusiasm; Consistently achieves results, even under tough circumstances
- 10% Learns and Grows - Actively seeks ways to grow and be challenged using both formal and informal development channels; Learns through successful and failed experiment when tackling new problems
- 20% Plans and Aligns - Creates new and better ways for the organization to be successful; Delivers multi-mode communications that convey a clear understanding of the unique needs of different audiences; Works the Product Team to ensure user stories are developer ready, easy to understand and testable; Collaborates with other team members in agile processes; Relates openly and comfortably with diverse groups of people; Adapts approach and demeanor in real time to match the shifting demands of different situations
- 20% Supports and Enables - Fields questions from product and engineering teams; Helps grow junior engineers by providing guidance on modern software development frameworks, and leading technical discussions; Notes gaps on the team and provides suggestions for changes to make the team more productive
Direct Manager/Direct Reports:
- This position typically reports to Software Engineer Manager or Sr. Manager
- This position typically has 0 Direct Reports
Travel Requirements:
Physical Requirements:
- Most of the time is spent sitting in a comfortable position and there is frequent opportunity to move about. On rare occasions there may be a need to move or lift light articles.
Working Conditions:
- Located in a comfortable indoor area. Any unpleasant conditions would be infrequent and not objectionable.
Minimum Qualifications:
- Must be eighteen years of age or older.
- Must be legally permitted to work in the United States.
Preferred Qualifications:
- 3-5 years of relevant work experience in site reliability engineering or related field
- Experience in monitoring and observability, including designing and implementing observability solutions using OpenTelemetry, Prometheus, and distributed tracing
- Proficiency in cloud platforms (GCP preferred) and infrastructure as code (Terraform, Ansible)
- Experience in programming languages such as, Go, Python, and Java
- Experience with creating and executing unit, functional, destructive, and performance tests
- Experience with modern debugging and root cause analysis techniques
- Experience in designing systems for High Availability, Disaster Recovery, Performance, Efficiency, and Security
- Experience in leading observability initiatives, including defining instrumentation standards and building monitoring dashboards
- Hands-on experience implementing alerting thresholds and automated responses based on service level objectives (SLOs)
- Strong experience with Kubernetes cluster management, optimization, and scaling
- Expertise in container orchestration, including best practices for containerized application deployments and resource optimization
- Experience designing, building, and maintaining scalable cloud infrastructure on GCP
- Proficiency in automating routine operational tasks to reduce toil and improve efficiency
- Familiarity with integrating observability-driven alerts with incident management systems and leading incident response efforts
- Experience optimizing system performance, identifying and resolving bottlenecks, and conducting capacity planning
- Knowledge of database performance tuning, query optimization, and designing application stress testing methodologies
- Familiarity with service mesh technologies (Istio, Linkerd)
Minimum Education:
- The knowledge, skills and abilities typically acquired through the completion of a bachelor's degree program or equivalent degree in a field of study related to the job.
Preferred Education:
Minimum Years of Work Experience:
Preferred Years of Work Experience:
- No additional years of experience
Minimum Leadership Experience:
Preferred Leadership Experience:
Certifications:
Competencies:
- Global Perspective
- Manages Ambiguity
- Nimble Learning
- Self-Development
- Collaborates
- Cultivates Innovation
- Situational Adaptability
- Communicates Effectively
- Drives Results
- Interpersonal Savvy
For California, Colorado, Connecticut, Rhode Island, Nevada, New York City, Ithaca (NY), Westchester County (NY), and Washington residents:
The pay range for this position is between $120,000 - $190,000