Loading...
Design a GPU cluster manager that schedules and manages ML workloads across a heterogeneous fleet of GPU nodes. The system supports multi-tenant fair-share scheduling, spot instance management, gang scheduling for distributed training, and efficient GPU utilization through fractional allocation. Key features: Submit training and inference jobs with GPU requirements. Multi-tenant fair-share scheduling with per-team quotas.
GPU nodes
5K+
GPU types
A100, H100, L40S
Concurrent jobs
10K
Build your design
Drag components from the palette to build your solution for "GPU Cluster Manager"