TL;DR — LLM Serving Infrastructure

Design an LLM serving infrastructure that hosts multiple large language models and serves inference requests at scale. The system must support continuous batching, KV-cache management, and efficient GPU scheduling to maximize throughput while meeting latency SLAs. Key features: Serve inference requests for multiple LLM variants concurrently. Stream tokens back to clients as they are generated.

HARD75 min

LLM Serving Infrastructure

LLMGPU schedulingKV cachebatchingmodel serving

Key Points

Serve inference requests for multiple LLM variants concurrently
Stream tokens back to clients as they are generated
Support continuous batching to maximize GPU utilization

Key Constraints

Concurrent requests

50K

GPU fleet

1K H100 GPUs

Models hosted

10-20 variants

Hints (0/3)

Canvas

Build your design

Drag components from the palette to build your solution for "LLM Serving Infrastructure"

LLM Serving Infrastructure

▶3D Simulate

Advanced75 min read+200 XP

TL;DR — LLM Serving Infrastructure

HARD75 min

LLM Serving Infrastructure

LLMGPU schedulingKV cachebatchingmodel serving

Key Points

Serve inference requests for multiple LLM variants concurrently
Stream tokens back to clients as they are generated
Support continuous batching to maximize GPU utilization

Key Constraints

Concurrent requests

50K

GPU fleet

1K H100 GPUs

Models hosted

10-20 variants

Hints (0/3)

Canvas

Build your design

Drag components from the palette to build your solution for "LLM Serving Infrastructure"