Description:
A Site Reliability Engineer (SRE) is an advanced DevOps role that combines software engineering and systems administration to ensure the scalability, performance, and reliability of large-scale, cloud-based applications and infrastructure.
A SRE has the overall responsibility of taking a proactive approach in detecting issues, automatically handling failures, preparing disaster recovery plans, keeping the system up and reliable, and mitigating broken systems and preventing them from causing future disruptions.
PRIMARY RESPONSIBILITIES
Ensure system reliability and availability
- Monitor system issues.
- Create strategies to detect issues.
- Address those issues.
- Design systems to troubleshoot automatically.
- Write and review post-mortems.
Mitigate operational risks
- Collaborate with development teams and other stakeholders to identify potential risks.
- Once risks are identified, analyze and evaluate potential impact and likelihood of occurrence.
- Based on the risk assessment, implement various risk mitigation strategies to mitigate operational risks.
- Continuously monitor and review the effectiveness of risk strategies.
Monitor system health
- Study historical trends in terms of performance by using metrics like charts and graphs.
- Trace the problems with system monitoring tools.
- Monitor log files to manage infrastructures at scale.
- Minimize emergency response
Maintain internal tooling
- GitHub workflows
- AWS
- Jenkins
- Jira
Other tasks as assigned
Requirements:
- Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent experience).
- Proven experience in designing, building, and operating large-scale distributed systems or cloud-based infrastructure.
- Proficiency in scripting and programming languages such as Python, Go, or Shell scripting.
- Deep understanding of networking and distributed systems.
- Experience working in cloud computing environments (e.g., AWS, GCP).
- Hands-on experience with containerization technologies (e.g., Docker, Kubernetes) and microservices architecture.
- Strong knowledge of infrastructure as code principles and tools (e.g., Terraform, Ansible, Chef).
- Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
- Excellent problem-solving skills and a proactive approach to troubleshooting complex issues.
- Effective communication skills and the ability to collaborate with cross-functional teams in a fast-paced environment.
Must be living in following states to qualify:
AR, AZ, CA, CO, FL, GA, ID, IL, KS, MN, SC, TN, WA