With a career at The Home Depot, you can be yourself and also be part of something bigger.

Position Purpose:

The Staff Reliability Engineer – Observability is responsible for leading the design, implementation, and evolution of observability solutions that ensure the reliability, performance, and efficiency of our systems. As a Staff Reliability Engineer, you will be part of a dynamic team with engineers of all experience levels who help each other build and grow technical and leadership skills while creating, deploying, and supporting production applications.

As a Staff Reliability Engineer, you are expected to build and grow the skillsets of the more junior Engineers.


Key Responsibilities:

  • 50% Delivery and Execution - Develops, tests, deploys, and maintains software, with a clear understanding of the value the software is to provide; Takes a broad view when approaching issues; using a global lens; Consistently achieves results, even under tough circumstances; Develops test suites (functional, destructive, etc) to enable success, rapid deployment of code to production; Takes on new opportunities and tough challenges with a sense of urgency, high energy and enthusiasm; Consistently achieves results, even under tough circumstances
  • 10% Learns and Grows - Actively seeks ways to grow and be challenged using both formal and informal development channels; Learns through successful and failed experiment when tackling new problems
  • 20% Plans and Aligns - Creates new and better ways for the organization to be successful; Delivers multi-mode communications that convey a clear understanding of the unique needs of different audiences; Works the Product Team to ensure user stories are developer ready, easy to understand and testable; Collaborates with other team members in agile processes; Relates openly and comfortably with diverse groups of people; Adapts approach and demeanor in real time to match the shifting demands of different situations
  • 20% Supports and Enables - Fields questions from product and engineering teams; Helps grow junior engineers by providing guidance on modern software development frameworks, and leading technical discussions; Notes gaps on the team and provides suggestions for changes to make the team more productive


Direct Manager/Direct Reports:

  • This position typically reports to Software Engineer Manager or Sr. Manager
  • This position typically has 0 Direct Reports


Travel Requirements:

  • No travel required.


Physical Requirements:

  • Most of the time is spent sitting in a comfortable position and there is frequent opportunity to move about. On rare occasions there may be a need to move or lift light articles.


Working Conditions:

  • Located in a comfortable indoor area. Any unpleasant conditions would be infrequent and not objectionable.


Minimum Qualifications:

  • Must be eighteen years of age or older.
  • Must be legally permitted to work in the United States.


Preferred Qualifications:

  • 3-5 years of relevant work experience in site reliability engineering or related field
  • Experience in monitoring and observability, including designing and implementing observability solutions using OpenTelemetry, Prometheus, and distributed tracing
  • Proficiency in cloud platforms (GCP preferred) and infrastructure as code (Terraform, Ansible)
  • Experience in programming languages such as, Go, Python, and Java
  • Experience with creating and executing unit, functional, destructive, and performance tests
  • Experience with modern debugging and root cause analysis techniques
  • Experience in designing systems for High Availability, Disaster Recovery, Performance, Efficiency, and Security
  • Experience in leading observability initiatives, including defining instrumentation standards and building monitoring dashboards
  • Hands-on experience implementing alerting thresholds and automated responses based on service level objectives (SLOs)
  • Strong experience with Kubernetes cluster management, optimization, and scaling
  • Expertise in container orchestration, including best practices for containerized application deployments and resource optimization
  • Experience designing, building, and maintaining scalable cloud infrastructure on GCP
  • Proficiency in automating routine operational tasks to reduce toil and improve efficiency
  • Familiarity with integrating observability-driven alerts with incident management systems and leading incident response efforts
  • Experience optimizing system performance, identifying and resolving bottlenecks, and conducting capacity planning
  • Knowledge of database performance tuning, query optimization, and designing application stress testing methodologies
  • Familiarity with service mesh technologies (Istio, Linkerd)


Minimum Education:

  • The knowledge, skills and abilities typically acquired through the completion of a bachelor's degree program or equivalent degree in a field of study related to the job.


Preferred Education:

  • No additional education


Minimum Years of Work Experience:

  • 3


Preferred Years of Work Experience:

  • No additional years of experience


Minimum Leadership Experience:

  • None


Preferred Leadership Experience:

  • None


Certifications:

  • None


Competencies:

  • Global Perspective
  • Manages Ambiguity
  • Nimble Learning
  • Self-Development
  • Collaborates
  • Cultivates Innovation
  • Situational Adaptability
  • Communicates Effectively
  • Drives Results
  • Interpersonal Savvy

For California, Colorado, Connecticut, Rhode Island, Nevada, New York City, Ithaca (NY), Westchester County (NY), and Washington residents:

The pay range for this position is between $120,000 - $190,000

Salary

Competitive

Project Basis based

Remote Job

Worldwide

Job Overview
Job Posted:
5 days ago
Job Expire:
1w 1d
Job Type
Contractual
Job Role
Any
Education
Any
Experience
Any
Total Vacancies
-

Share This Job:

Location

United States