TL;DR — GPU Cluster Manager

Design a GPU cluster manager that schedules and manages ML workloads across a heterogeneous fleet of GPU nodes. The system supports multi-tenant fair-share scheduling, spot instance management, gang scheduling for distributed training, and efficient GPU utilization through fractional allocation. Key features: Submit training and inference jobs with GPU requirements. Multi-tenant fair-share scheduling with per-team quotas.

HARD60 min

GPU Cluster Manager

job schedulingmulti-tenancyspot instancesGPU management

Key Points

Submit training and inference jobs with GPU requirements
Multi-tenant fair-share scheduling with per-team quotas
Gang scheduling for distributed multi-node training

Key Constraints

GPU nodes

5K+

GPU types

A100, H100, L40S

Concurrent jobs

10K

Hints (0/3)

Canvas

Build your design

Drag components from the palette to build your solution for "GPU Cluster Manager"

GPU Cluster Manager

▶3D Simulate

Advanced60 min read+200 XP

TL;DR — GPU Cluster Manager

HARD60 min

GPU Cluster Manager

job schedulingmulti-tenancyspot instancesGPU management

Key Points

Submit training and inference jobs with GPU requirements
Multi-tenant fair-share scheduling with per-team quotas
Gang scheduling for distributed multi-node training

Key Constraints

GPU nodes

5K+

GPU types

A100, H100, L40S

Concurrent jobs

10K

Hints (0/3)

Canvas

Build your design

Drag components from the palette to build your solution for "GPU Cluster Manager"