Job Description

As our Site Reliability Engineer, you will design, build, and maintain the systems and infrastructure that power our applications, ensuring their reliability, scalability, and performance. You will bring a software engineering approach to operations, automating processes, and continuously improving the infrastructure and tools to support our business needs.


What you’ll do:

  • Infrastructure Management: Design, implement, and maintain scalable and resilient infrastructure using Terraform for infrastructure as code, ensuring high availability and performance.
  • Kubernetes and Containers: Deploy, manage, and optimize Kubernetes clusters and containerized applications using Docker. Implement best practices for container orchestration and management.
  • Systems and Application Monitoring/Observability: Develop and maintain comprehensive monitoring and observability solutions using Datadog. Ensure detailed visibility into system performance and application health.
  • SLOs and SLA Management: Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Agreements (SLAs) to ensure reliable and consistent service delivery.
  • Incident Response and Troubleshooting: Respond to incidents, perform root cause analysis, and implement solutions to prevent recurrence. Participate in post-incident reviews and contribute to blameless postmortems.
  • Reliability and Production Environment Management: Ensure the reliability and stability of our production environments. Continuously assess and improve system reliability, identifying and addressing potential points of failure.
  • Automation and Scripting: Develop automation scripts and tools to reduce manual intervention and improve system reliability using Python, Bash, or Go. Implement and improve CI/CD pipelines.
  • CI/CD Pipeline Management: Enhance and maintain continuous integration and continuous deployment pipelines using GitLab CI. Ensure seamless and reliable deployment processes.
  • Capacity Planning and Scaling: Assist in capacity planning and ensure that systems are scalable to meet future demands. Implement auto-scaling strategies where applicable.
  • Security and Compliance: Implement security best practices and ensure compliance with industry standards. Regularly review and update security policies and procedures.
  • Collaboration and Support: Work closely with development teams to ensure reliability and scalability of new features and services. Provide technical support and guidance on infrastructure-related issues.
  • Software Engineering for Operations: Develop and maintain internal tools and services that enhance the efficiency and reliability of our operations.
  • On-Call Rotation: Participate in an on-call rotation to address production issues and collaborate in incident response efforts.

Salary

Competitive

Project Basis based

Remote Job

Worldwide

Job Overview
Job Posted:
7 months ago
Job Type
Contractual
Job Role
Any
Education
Any
Experience
Any
Total Vacancies
-

Share This Job:

Location

United States