Epicareer Might not Working Properly
Learn More

Director, Operations (Cloud)

Salary undisclosed

Apply on


Original
Simplified
Be a Part of Something BIG!

Make an Impact by

  • Team Management:
  • Build and lead a high-performance engineering and operations team to foster a culture of innovation, collaboration, and continuous improvement.
  • Set clear goals and objectives, mentor team members, and drive professional development initiatives
  • Operational Excellence:
  • Develop and implement operational strategies to ensure the reliability, scalability, and efficiency of our GPU Cloud services.
  • Collaborate with other departments to streamline processes, enhance customer experience, and meet service level agreements.
  • Support services and improve the lifecycle of GPU cloud with monitoring, logging, and alerting through deployment, operation, and refinement.
  • Establish Ops systems/processes (SOPs, EOPs etc) and to manage daily operational issues.
  • Possess strong operational management skill set which involves organising the entire Operations team and external vendors to ensure an efficient and resilient ops setup.
  • Infrastructure and Resource Management:
  • Manage the deployment, configuration, and maintenance of GPU clusters and associated infrastructure.
  • Optimize resource allocation to meet performance requirements and cost-effectiveness goals.
  • Build high performance storage that can complement the GPU cloud to enable customers to submit and run large AI workloads.
  • Build a roadmap of software solutions that can complement the GPU cloud to take out overhead of AI job creation and execution for customers.
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
  • Security and Compliance:
  • Enforce best practices for security and compliance within the GPU Cloud environment.
  • Stay abreast of industry security trends and implement measures to safeguard customer data and platform integrity.

Skills For Success

  • Experienced in Linux cluster system (Ubuntu, CentOS/Redhat) or hypervisor administration.
  • GPU technologies and their integration into accelerated computing (GPU architectures, parallel distributed computation, and network)
  • RDMA network technology for GPU Direct RDMA (Infiniband and kernel bypassing, protocol, topology)
  • Complex technical problem solving with a proactive approach to system operation and optimization.
  • Experienced in crafting, analysing, and fixing large-scale distributed systems.
  • Good understanding of AI/ML software frameworks (Library, NCCL, CUDA, open-source)
  • Understanding of collective communication on GPU system (Intra node, Inter node)
  • Experience in system benchmarking and profiling for GPU cluster
  • Storage system (Parallel distributed file system, NFS, Object Storage)

Rewards that Go Beyond

  • Flexible work arrangements
  • Full suite of health and wellness benefits
  • Ongoing training and development programs
  • Internal mobility opportunities

Your Career Growth Starts Here. Apply Now!

We are committed to a safe and healthy environment for our employees & customers and will require all prospective employees to be fully vaccinated.