Epicareer Might not Working Properly
Learn More

HPC System Engineer (System), NSCC

Salary undisclosed

Apply on


Original
Simplified
Job Summary

The HPC System Engineer will be responsible for managing, monitoring and optimizing the operational of supercomputing system. This role involves collaborating with various research and technical teams to optimize HPC resources utilization. Successful candidate with demonstrated experience in the HPC field may be considered for a Senior position.

Roles And Responsibilities

System administration and optimization

  • Work with Managed Services teams in managing and administering HPC systems, including servers, storage, and internal network components.
  • Ensure the reliability and availability of HPC infrastructure.
  • Provide support on technical queries and troubleshooting HPC-related problems.
  • Implement best practices for system monitoring and reporting.
  • Develop utility tools to support monitoring, tuning, and troubleshooting activities.
  • Document incident details, resolution, and lessons learned to enhance future problem-solving.
  • Implement security measures and monitoring to protect HPC systems.
  • Conduct regular security check and assessments within HPC system infrastructure.
  • Monitor system performance and optimize the performance through tuning and troubleshooting.

Resource and workload management

  • Monitor HPC resource utilization.
  • Develop and evaluate policies for allocating HPC resources.
  • Optimize job scheduling to maximize resource utilization.

Designing and planning

  • Assess future computational requirements and plan for system expansion.
  • Assist in the designing of future HPC system acquisition.
  • Study and evaluate emerging technologies and trends, including but not limited to:
  • processor and accelerators
  • interconnect technology
  • storage solutions
  • programming models

Qualifications

  • Degree in a Computer Science, Engineering, IT or other relevant areas.
  • At least 3 years of experience in managing HPC systems.
  • Highly proficient in UNIX/Linux environments and command line interface (CLI).
  • Experience with cluster management software (xCAT, BCM, PHPC, HPCM).
  • Experience with job scheduling and workload management software (Slurm or PBS Pro)
  • Strong knowledge of HPC storage principles and experience in managing parallel file system (Lustre, GPFS, BeeGFS).
  • Strong knowledge of RDMA-based interconnect (InfiniBand, RoCE).
  • Understanding of basic network protocols like DHCP, DNS, TFTP, SMTP, etc.
  • Good knowledge of scripting languages like Python, Bash or Perl.
  • Demonstrate ability to analyse complex issues and develop effective solutions.
  • To be considered for Senior position, candidates should have at least 5 years of experience in roles that involve the deployment of HPC systems, covering key areas such as designing, installing, configuring, documentation and providing admin/user training.