Apply on
Original
Simplified
Be a Part of Something BIG!
Make an Impact by
We are committed to a safe and healthy environment for our employees & customers and will require all prospective employees to be fully vaccinated.
Make an Impact by
- Team Management:
- Build and lead a high-performance engineering and operations team to foster a culture of innovation, collaboration, and continuous improvement.
- Set clear goals and objectives, mentor team members, and drive professional development initiatives
- Operational Excellence:
- Develop and implement operational strategies to ensure the reliability, scalability, and efficiency of our GPU Cloud services.
- Collaborate with other departments to streamline processes, enhance customer experience, and meet service level agreements.
- Support services and improve the lifecycle of GPU cloud with monitoring, logging, and alerting through deployment, operation, and refinement.
- Establish Ops systems/processes (SOPs, EOPs etc) and to manage daily operational issues.
- Possess strong operational management skill set which involves organising the entire Operations team and external vendors to ensure an efficient and resilient ops setup.
- Infrastructure and Resource Management:
- Manage the deployment, configuration, and maintenance of GPU clusters and associated infrastructure.
- Optimize resource allocation to meet performance requirements and cost-effectiveness goals.
- Build high performance storage that can complement the GPU cloud to enable customers to submit and run large AI workloads.
- Build a roadmap of software solutions that can complement the GPU cloud to take out overhead of AI job creation and execution for customers.
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- Security and Compliance:
- Enforce best practices for security and compliance within the GPU Cloud environment.
- Stay abreast of industry security trends and implement measures to safeguard customer data and platform integrity.
- Experienced in Linux cluster system (Ubuntu, CentOS/Redhat) or hypervisor administration.
- GPU technologies and their integration into accelerated computing (GPU architectures, parallel distributed computation, and network)
- RDMA network technology for GPU Direct RDMA (Infiniband and kernel bypassing, protocol, topology)
- Complex technical problem solving with a proactive approach to system operation and optimization.
- Experienced in crafting, analysing, and fixing large-scale distributed systems.
- Good understanding of AI/ML software frameworks (Library, NCCL, CUDA, open-source)
- Understanding of collective communication on GPU system (Intra node, Inter node)
- Experience in system benchmarking and profiling for GPU cluster
- Storage system (Parallel distributed file system, NFS, Object Storage)
- Flexible work arrangements
- Full suite of health and wellness benefits
- Ongoing training and development programs
- Internal mobility opportunities
We are committed to a safe and healthy environment for our employees & customers and will require all prospective employees to be fully vaccinated.
About Singtel
Size | More than 250 |
Industry | Integrated Telecommunication Services |
Location | Singapore |
Founded | 1 January 1879 |
Similar Jobs