TL;DR — Web Crawler

Design a distributed web crawler that can fetch billions of web pages efficiently. The crawler must respect robots.txt rules, enforce politeness policies (not overloading any single host), deduplicate content, and prioritize URLs based on importance and freshness requirements. Key features: Crawl web pages starting from a seed set of URLs. Extract and follow hyperlinks to discover new pages.

HARD60 min

Web Crawler

distributed crawlingURL frontierdeduplicationpolitenessscheduling

Key Points

Crawl web pages starting from a seed set of URLs
Extract and follow hyperlinks to discover new pages
Respect robots.txt directives and crawl-delay

Key Constraints

Pages/day

URL frontier size

10B URLs

Crawler nodes

5,000

Hints (0/3)

Canvas

Build your design

Drag components from the palette to build your solution for "Web Crawler"

Web Crawler

▶3D Simulate

Advanced60 min read+200 XP

TL;DR — Web Crawler

HARD60 min

Web Crawler

distributed crawlingURL frontierdeduplicationpolitenessscheduling

Key Points

Crawl web pages starting from a seed set of URLs
Extract and follow hyperlinks to discover new pages
Respect robots.txt directives and crawl-delay

Key Constraints

Pages/day

URL frontier size

10B URLs

Crawler nodes

5,000

Hints (0/3)

Canvas

Build your design

Drag components from the palette to build your solution for "Web Crawler"