Loading...
Design an LLM serving infrastructure that hosts multiple large language models and serves inference requests at scale. The system must support continuous batching, KV-cache management, and efficient GPU scheduling to maximize throughput while meeting latency SLAs. Key features: Serve inference requests for multiple LLM variants concurrently. Stream tokens back to clients as they are generated.
Concurrent requests
50K
GPU fleet
1K H100 GPUs
Models hosted
10-20 variants
Build your design
Drag components from the palette to build your solution for "LLM Serving Infrastructure"