Loading...
Design a distributed web crawler that can fetch billions of web pages efficiently. The crawler must respect robots.txt rules, enforce politeness policies (not overloading any single host), deduplicate content, and prioritize URLs based on importance and freshness requirements. Key features: Crawl web pages starting from a seed set of URLs. Extract and follow hyperlinks to discover new pages.
Pages/day
1B
URL frontier size
10B URLs
Crawler nodes
5,000
Build your design
Drag components from the palette to build your solution for "Web Crawler"