A Distributed Storage
System For Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.
Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra,
Andrew Fikes, Robert E. Gruber
See UW presentation by Jeff Dean from the link on the seminar page, or just
google for “google bigtable”
Data Storage: BigTable
What is it, really?
• 10-ft view: Row &
• column abstraction for
• Reality: Distributed,
• Design/initial implementation started beginning of 2004
– Take about 12 guys and putting in active use for about 8 months
• By Nov. 2005, it has ~100 BigTable cells
• Production use or active development for many internal
– Google Print
– My Search History
– Crawling/indexing pipeline
– Google Maps/Google Earth
• Largest bigtable cell manages ~200TB of data spread over
several thousand machines (larger cells planned)
Sample Problem Domains
• Offline batch jobs
– Large datasets (PBs), bulk reads/writes (MB chunks)
– Short outages acceptable
– Web indexing, log processing, satellite imagery, etc.
• Online applications
– Smaller datasets (TBs), small reads/writes small (KBs)
– Outages immediately visible to users, low latency vital
– Web search, Orkut, GMail, Google Docs, etc.
• Many areas: IR, machine learning, image/video
processing, NLP, machine translation, ...
Typical New Engineer
• Never seen a petabyte
• Never used a
• Never really
How to design a software to make them successful?
• Lots of (semi-)structured data at Google
• Contents, crawl metadata, links, anchors, pagerank,…
– Per-user data:
• User preference settings, recent queries/search results, …
– Geographic locations:
• Physical entities (shops, restaurants, etc.), roads, satellite image
data, user annotations, …
• Scale is large
– Billions of URLs, many versions/page (~20K/version)
– Hundreds of millions of users, thousands of q/sec