Title: HPC System Administrator
Location: Los Angeles, CA (2 days/week Onsite)
Duties Need to perform:
Serve as a technical expert and assume full responsibility for the OARC HPC backup environment.
- Design, implement, enhance, and maintain a backup system supporting OARC's high performance research cluster storage.
- Project needs for hardware and software upgrades required to maintain reliability in the backup environment.
- Monitor the state of the backup system, ensure proper backup coverage, and ensure backup targets (RTO and RPO) are being met.
- Analyze file-level metadata pertaining to the Hoffman2 storage system, and develop reports that communicate the current state of the backup system.
- Write scripts for management and reporting as necessary to integrate the backup system into the OARC HPC environment.
- Write SQL queries and construct well-designed database and table structures and schema to support the backup system, as needed.
- Document the backup system and train others on its use.
Assist in administration of multiple components of HPC cluster systems running Unix-like operating systems.
- Write and support scripts to automate routine system administration duties. (E)
- Install operating systems, system components, and applications in bare metal, and VM, and cloud-based environments. (E)
- Monitor systems for security problems, perform security audits, and take preemptive or corrective action as necessary. (E)
- Maintain and upgrade technical and operational documentation as it pertains to the HPC environment. (E)
Address system and user- -related client issues submitted through support tickets. (E)
- Perform tasks related to physical infrastructure including computer servers and cabling. (E)
Maintain Knowledge:
- Maintain current knowledge of new programming languages and techniques, hardware and software architectures, network fabrics, storage systems, and other technologies that will or could impact the OARC high performance research computing environment. Propose ways to apply new technology and techniques to the OARC environment. (E)
- Maintain current knowledge of the OARC technical environment, including software tools, operating systems, system software, and the OARC network.
- Maintain current knowledge of security issues which might impact OARC systems. (E)
- Continue professional development, through self-directed study and special assignments, to maintain proficiency and currency in high performance computing applications at a level sufficient to provide for the continued advance of research computing at UCLA. (E)
Required Skill Set:
- Bachelor's or Master's degree in computer science, software engineering, or a related field.
- Minimum of three years of experience with software and applications development, Linux system administration, and two or more modern programming languages (e.g. Python, C++, Java).
- Expert knowledge of Python, SQL, bash, and git. Working knowledge of other common build systems, languages, and development tools is preferred.
- Demonstrated ability to create secure, technically sound, high quality system scripts using recognized programming methodologies and practices and result in maintainable and usable software. Ability to utilize third-party and/or open-source libraries or tools where applicable.
- Detailed knowledge of Red Hat Enterprise Linux and related distributions.
- Demonstrated ability to troubleshoot and debug computing problems including, but not limited to: corrupted data, file management, application software, and operating system problems.
- Skill in responding to production problems in a storage backup environment quickly, accurately, independently, and with adequate follow-up.
- Demonstrated working knowledge of network file systems (NFS versions 3 and 4) and object storage (e.g., S3).
- Skill in developing and performing tests and evaluations of multiple complex software components including validation, verification, and disaster recovery capabilities.
- Ability to carry out benchmark, debugging, and testing processes in a clear, complete, and technically sound manner, make valid comparisons, and develop reports or analysis summarizing the process.
- Demonstrated skill in writing well-organized, complete, and technically and grammatically correct documents and procedures to be used by technical and non-technical personnel.
- Demonstrated oral communication and presentation skills sufficient to effectively obtain and impart technical information and explain concepts on a one-to-one basis as well as in meetings with or presentations to multiple clients.
- Demonstrated problem-solving skills and the ability to break down and define complex problems, formulate solutions, identify cause and effect relationships, make appropriate decisions, and communicate concepts clearly and appropriately.
- Ability to prioritize tasks, effectively manage time, estimate time and effort required for software tasks and projects, prepare project plans and schedules, and ensure tasks are completed on time.
- Demonstrated ability to work effectively both independently and as part of a multi-disciplinary team, and to follow through on assignments with minimal direction under the stress of frequent interruptions and distractions.
- Demonstrated skill in establishing and maintaining cooperative working relationships with staff, students and vendors. Ability to communicate and interact effectively with persons of diverse backgrounds.
- Ability to use, and demonstrated willingness to review, systems resources (e.g., documents, manuals, industry publications, vendor support resources) to maintain current knowledge of the information systems profession
- Ability to work an alternate work schedule on short notice to resolve critical problems or comply with testing or maintenance schedules.
- Demonstrated working knowledge of HPC cluster architectures and concepts.
- Ability to lift 50-60-pound objects such as computer servers.