Role: Site Reliability Engineer (SRE)
Role Type: Full Time
Location: Remote or Southern California
Key Competencies: Maintaining production systems and services, designing systems and solutions, incident monitoring, infrastructure such as code (IaC), cloud technology, scripting, monitoring, and containerization. DevOps.
Job Summary:
As a Site Reliability Engineer, you will play a critical role in ensuring the stability, reliability, and performance of our production systems and services. You will work closely with software engineering, DevOps, and IT operations teams to design and build scalable, reliable, and efficient systems that support our business operations. This role combines software development, systems engineering, and operational expertise to keep our applications running smoothly.
Responsibilities:
System Reliability & Performance: Design, implement, and maintain solutions to improve the reliability, scalability, and performance of production systems.
Monitoring & Incident Response: Set up monitoring, alerting, and incident response systems to detect, troubleshoot, and resolve production issues proactively.
Automation & Infrastructure as Code (IaC): Develop and maintain automation scripts to manage infrastructure, deployment, and routine tasks, minimizing human intervention.
Capacity Planning & Scaling: Collaborate with cross-functional teams to manage system capacity planning and scaling, ensuring our systems meet current and future demands.
System Health & Troubleshooting: Monitor system health, troubleshoot issues, and address service failures, latency issues, and performance bottlenecks.
On-Call Support: Participate in on-call rotation for monitoring and support of production systems.
Documentation: Maintain detailed documentation for system designs, processes, and procedures to support team knowledge sharing and continuity.
Requirements:
Educational Background: Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
Experience: Proven experience as a Site Reliability Engineer, DevOps Engineer, or in a similar role.
Technical Skills:
Proficiency with cloud platforms (e.g., AWS, GCP, Azure).
Strong scripting and automation skills (e.g., Python, Bash, Ansible).
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Splunk).
Knowledge of containerization and orchestration (e.g., Docker, Kubernetes).
Familiarity with Infrastructure as Code (IaC) tools (e.g., Terraform, CloudFormation).
Soft Skills: Strong problem-solving skills, excellent communication skills, and a team-oriented mindset.
Preferred Qualifications:
Experience with CI/CD pipelines and DevOps best practices.
Familiarity with security best practices for system reliability.
Certifications in cloud technologies or DevOps practices are a plus.
Job Type: Full-time
Pay: $110,000.00 - $140,000.00 per year
Benefits:
401(k)
Dental insurance
Health insurance
Paid time off
Vision insurance
Compensation Package:
Yearly bonus
Schedule:
8 hour shift
Monday to Friday
Education:
Bachelor's (Required)
Experience:
Site Reliability Engineer (SRE): 2 years (Required)
DevOps: 1 year (Required)
cloud platforms: 2 years (Required)
IT: 4 years (Required)
License/Certification:
Cloud Certification (Preferred)
Work Location: Remote