HPC System Administrator
Job Description
Job Description
We are seeking a skilled HPC System Administrator to manage and maintain high-performance computing (HPC) systems. The ideal candidate will be responsible for system administration, user support, software integration, and collaboration with research teams to optimize computational workflows.
Key Responsibilities:
1. HPC System Management and Maintenance
Install, configure, integrate, and maintain high-performance compute clusters and associated hardware
Monitor system performance, troubleshoot issues, and ensure security compliance
Process and document change management procedures
2. User Support and Consultation
Assist users with computational jobs and optimize workflows for efficient resource utilization
Provide training sessions and resolve user issues related to HPC environments
3. Software and Application Support
Install, configure, and maintain scientific and engineering HPC software solutions
Support software development for parallel computing and performance optimization
4. Collaboration with Research Teams
Understand research project requirements and recommend appropriate HPC solutions
Assist in designing and optimizing computational workflows for researchers
5. Resource Allocation and Scheduling
Manage resource allocation and job scheduling within the HPC environment
Implement policies for job queuing, resource limits, and workload balancing
Enforce operational best practices and implementation plans
6. System and Network Optimization
Configure and maintain high-speed networks for optimal data transfer within the HPC infrastructure
Conduct performance benchmarking and optimization efforts
7. Documentation and Reporting
Maintain detailed system documentation, configuration guides, and user manuals
Generate reports on system performance, resource utilization, and operational efficiency.
Qualifications and Skills:
Strong experience with HPC system administration, Linux-based environments, and cluster management tools.
Proficiency in job scheduling and resource management frameworks (e.g., Slurm, PBS, Grid Engine).
Hands-on experience with networking protocols, security policies, and data transfer optimizations.
Familiarity with scientific computing software and parallel programming techniques. Ability to troubleshoot complex system and application issues effectively.
Strong communication skills to collaborate with researchers and support teams.
Job Description
Job Description
We are seeking a skilled HPC System Administrator to manage and maintain high-performance computing (HPC) systems. The ideal candidate will be responsible for system administration, user support, software integration, and collaboration with research teams to optimize computational workflows.
Key Responsibilities:
1. HPC System Management and Maintenance
Install, configure, integrate, and maintain high-performance compute clusters and associated hardware
Monitor system performance, troubleshoot issues, and ensure security compliance
Process and document change management procedures
2. User Support and Consultation
Assist users with computational jobs and optimize workflows for efficient resource utilization
Provide training sessions and resolve user issues related to HPC environments
3. Software and Application Support
Install, configure, and maintain scientific and engineering HPC software solutions
Support software development for parallel computing and performance optimization
4. Collaboration with Research Teams
Understand research project requirements and recommend appropriate HPC solutions
Assist in designing and optimizing computational workflows for researchers
5. Resource Allocation and Scheduling
Manage resource allocation and job scheduling within the HPC environment
Implement policies for job queuing, resource limits, and workload balancing
Enforce operational best practices and implementation plans
6. System and Network Optimization
Configure and maintain high-speed networks for optimal data transfer within the HPC infrastructure
Conduct performance benchmarking and optimization efforts
7. Documentation and Reporting
Maintain detailed system documentation, configuration guides, and user manuals
Generate reports on system performance, resource utilization, and operational efficiency.
Qualifications and Skills:
Strong experience with HPC system administration, Linux-based environments, and cluster management tools.
Proficiency in job scheduling and resource management frameworks (e.g., Slurm, PBS, Grid Engine).
Hands-on experience with networking protocols, security policies, and data transfer optimizations.
Familiarity with scientific computing software and parallel programming techniques. Ability to troubleshoot complex system and application issues effectively.
Strong communication skills to collaborate with researchers and support teams.