AI/HPC Infrastructure Engineer
Apply on
Responsibilities
DSO National Laboratories (DSO) is Singapore’s largest defence research and development (R&D) organisation, with the critical mission to develop technological solutions to sharpen the cutting edge of Singapore's national security. At DSO, you will develop more than just a career. This is where you will make a real impact and shape the future of defence across the spectrum of air, land, sea, space and cyberspace.
The Digital Division leads the digital transformation of DSO through the master planning and policies, delivering digital capabilities through IT infrastructure, and providing one stop service to corporate and R&D Divisions. The Digital Division will transform the way we work, our workplace, and the capabilities we deliver to the MINDEF/SAF and for the security of Singapore.
People are DSO’s greatest asset. You will get to realise your career aspirations and develop your own niche either as a deep technical expert or a leader in the team. With frequent career dialogues and a robust training and development framework, we will provide you with the necessary development tools for you to reach your potential. You will also be recognised and rewarded through competitive remuneration packages and scholarship opportunities.
AI/HPC Infrastructure Engineer
We are seeking an experienced AI/HPC Infrastructure Engineer to join our dynamic team. As an AI Infrastructure Engineer, you will play a crucial role in designing, implementing, and managing the infrastructure that supports our AI initiatives. Your expertise will contribute to the development, deployment, and scaling of AI models, ensuring their optimal performance and reliability.
In this role, you will be involved in:
- Infrastructure Design: Collaborate with cross-functional teams, including AI R&D engineers and software engineers, to design and continually enhance scalable and efficient on-premise AI infrastructure solutions to train and serve large AI models. Create, evolve and maintain the infrastructure roadmap aligned with the organization's AI strategy.
- Scalability and Performance: Identify and address performance bottlenecks, latency issues, and scalability challenges in AI infrastructure. Leverage your expertise to optimize resource allocation and improve data processing pipelines.
- Monitoring and Maintenance: Establish robust monitoring systems to track the health, performance, and utilization of AI infrastructure components. Proactively identify and resolve issues, ensuring high availability and reliability of AI systems.
- Security and Compliance: Implement security measures and best practices to protect AI infrastructure and data. Ensure compliance with relevant regulations, privacy standards, and industry best practices.
- Collaboration and Documentation: Work closely with cross-functional teams to understand their requirements and provide technical guidance. Document infrastructure configurations, processes, and troubleshooting procedures to enable efficient knowledge sharing and onboarding.
Requirements
- Degree in Computer Engineering / Computer Science/ Artificial Intelligence
- Familiarity with cluster management tools like Bright, data processing frameworks (e.g., Apache Spark, Apache Beam), machine learning frameworks (e.g., TensorFlow, PyTorch), networking for HPC applications, containerization technologies (e.g., Docker, Kubernetes) and HPC scheduling
- Infrastructure Optimization: Experience in optimizing infrastructure for performance, scalability, and cost-efficiency. Knowledge of distributed systems, network architecture, and storage technologies for AI and/or HPC
- Problem-Solving Abilities: Demonstrated ability to analyse complex problems, propose innovative solutions, and implement them effectively. Strong troubleshooting and debugging skills to resolve infrastructure-related issues
- Collaboration and Communication: Excellent interpersonal skills with the ability to collaborate effectively in a team environment. Strong verbal and written communication skills to convey technical concepts to both technical and non-technical stakeholders.