Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based in Santa Clara, CA, with Hybrid work options. We offer competitive pay $75 - $90/hr based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company culture!
As an SRE, you will be troubleshooting and managing our client's on-premises infrastructure to support various software engineering teams' company wide. Keen attention to detail, problem-solving abilities, and a solid knowledge base are essential.
What you'll be doing:
- Working on systems deployed in NVIDIA's internal cloud making them available and reliable for our end users.
- Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization.
- Providing high quality of user support.
- Monitoring KPIs and making sure that team's SLAs are met.
- Managing and maintaining production Kubernetes clusters.
- Drive automation of monitoring to gain more insight into applications and system health.
- Craft and implement critical metrics using various analytics methods and dashboards.
- Reuse AI techniques to extract useful signals about machines and jobs from the data generated.
What we need to see:
- Experience working with on-premise infrastructure.
- Experience managing and troubleshooting Linux systems.
- Experience managing systems installed data centers. Proficient with BMC (Redfish), KVM, and IPMI tools.
- Background in Databases like SQL (MySQL) and timeseries DBs like Prometheus.
- Strong knowledge of networking principles and protocols, including TCP/IP, DNS, DHCP, and VLANs.
- Experience with data analytics/visualization tools like Kibana, Grafana, Splunk etc.
- Strong Ansible or Jenkins skills.
- Proficient with Kubernetes, dockers & virtualization.
- Proficient using source code management and binary repository systems like GitLab, GitHub, Artifactory, Perforce etc.
- Advanced knowledge of standard methodologies related to security.
- 5+ years of proven SRE experience.
- Experience with Python or Bash scripting.
- Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience.
Ways to stand out from the crowd:
- Working knowledge of OpenStack.
- Previous experience with SRE teams managing on-prem infrastructure.
- Experience managing NVIDIA hardware like GPUs and Tegras.
- Thrives in a multi-tasking environment with constantly evolving priorities.
- Prior experience with large scale operations team.
- Experience with Windows server infrastructure.
- Outstanding interpersonal skills and communication with all levels of management.
- Experience with using and improving data centers.
- Ability to analyze sophisticated problems into simple sub problems and then reuse available solutions to implement most of those.
- Ability to design simple systems that can work efficiently without needing much support.
Sustainable Talent is a M/F+, disabled, and veteran equal employment opportunity and affirmative action employer.