This post is a summary of the paper:
The anatomy of a large-scale hypertextual Web search engine, Brin et al., 1998
Introduction
This is a well-known paper. It was published in 1998 and it describes Google as a prototype of a large-scale search engine.
According to the authors, this paper was the first in-depth public description of a large-scale web search engine.
The authors describe how to build a practical large-scale search system that exploits information present in hypertext, thus producing better search results than existing search systems. While reading this paper, keep in mind that this was written in 1998 when popular search engines like Yahoo were built on mostly human maintained indices.
Google was designed to scale well to extremely large data sets from the very beginning. This included fast crawling to gather web documents and to keep them up to date in the index, efficient use of storage space to store the index, highly optimized data structures for fast and efficient access, and low-latency query handling.
Architecture
Google downloads web pages in a process called crawling using several distributed crawlers. The crawlers receive URLs to fetch from a URL Server and send the downloaded web pages or documents to a Store Server.
The Store Server compresses the documents and stores them in a repository.
The Indexer reads the repository, decompresses and parses the documents. In doing so, it creates a distributed forward index recording word occurrences in each document, the position of these words in the document, approximate font size and capitalization. The Indexer also parses out all links and stores information about these links in an anchors file.
The URL Resolver reads the anchors file, converts URLs to docIDs and puts anchor text in the forward index associated with the docID that the anchor points to. It also generates the link database recording pairs of docIDs connected by links. This database is used to compute PageRanks for all documents.
The Sorter sorts the forward index by wordID to generate the inverted index and produces a list of wordIDs and offsets into the inverted index.
DumpLexicon combines this list with the lexicon generated by the Indexer to build a lexicon to be used by Searcher.
The Searcher uses this lexicon, inverted index and PageRank to serve queries.
Crawling
The authors describe interesting problems in this part of the system, not entirely related to performance and reliability that we typically see in a large scale distributed system.
At the time when Google was built, few people knew about the robots exclusion protocol. They tried to protect their pages from being indexed by writing a statement in plain English that the page was copyrighted and should not be indexed. For obvious reasons, this was difficult for web crawlers to understand.
There were times when the crawlers would encounter special websites like online games and try to crawl them. This resulted in unexpected behaviors in the applications as well as in the crawlers. This is an example of obscure problems that may occur on a few pages out of the whole web. Problems like these do not surface until the crawlers have downloaded millions of pages.
Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on a large part of the Internet.
Google harnessed a fast distributed crawling system to scale to hundreds of millions of pages. Each crawler kept ~300 connections open and maintained its own DNS cache so that it doesn’t need to perform DNS lookups before crawling a document. Typically 3 crawlers were run at any point in time.
Indexing
Parsing is the core and one of the most failure-prone components of indexing. A parser that is supposed to run on the entire Web must handle a huge array of corner cases and error scenarios.
Each document is parsed and encoded into several shards that comprise the forward index (the authors refer to shards as barrels).
To make the indexers run in parallel without sharing state, Google maintained a base lexicon with 14 million words to convert each word into a wordID. Words that were not contained in the base lexicon were written in a log file that was processed in the end by a final indexer.
Fixing the size of the base lexicon at 14 million words was beneficial additionally by allowing the lexicon to be in-memory and still fit on a single machine with 256 MB of main memory. This is one of the ways that disk seeks were avoided.
To create the inverted index, the Sorter takes each barrel and sorts it by wordID.
Since each barrel cannot fit in memory, the sorter subdivides barrels into smaller parts based on wordID and docID, sorting each part and writing it out to form the complete inverted index.
Multiple sorters are run to parallelize the sorting phase.
Anchor Text
Handling anchor text differently was one of the distinguishing features of Google.
At the time, most search engines associated the text of a link with the page that the link was found on. Google went a step further and associated links to pages they point to. This is called anchor propagation and helps in providing better quality search results.
The advantage of anchor propagation is that anchors provide more accurate descriptions of web pages than the web pages themselves and that anchors exist for non-text documents, such as images, that cannot be indexed by a text-based search engine.
Searching
At the time when Google was built, there were already many commercial search engines in the market. However, they relied on human-curated lists of web pages and/or produced low-quality search results.
Google focused more on quality of search by exploiting information present in the web page hypertext (like the position of a word, font, capitalization) and developing PageRank, a measure of a web page’s importance based on citations (or links that point to it).
PageRank
The concept of counting citations to a given page to estimate a page’s importance or quality had already been explored before Google was built.
PageRank extended the idea by not counting links from all pages equally and by normalizing the number of links on a page.
The PageRank of a page A is given as follows:
PR(A) = (1-d) + d*(PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
whereT1, T2...Tn are pages that point to page Ad is a damping factor usually set to 0.85C(A) is the number of links going out of page A
PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. And, the
ddamping factor is the probability at each page the “random surfer” will get bored and request another random page.
Ranking
Google maintained much more information about web documents than typical search engines of its time. Combining all of this information into a rank became a challenging task.
The ranking function was so designed such that no particular feature like position, font, capitalization, PageRank, or anchor text could have too much influence.
Each hit of different type (title, anchor, URL, plain text, font, etc) was assigned its own type-weight composing a vector indexed by type. A vector of count-weights was similarly produced by counting the number of hits of each type. The IR score was computed by taking a dot product of vectors count-weights and type-weights.
PageRank was finally combined with the IR score to arrive at the final rank of the document.
Noteworthy is the fact that a user feedback mechanism was built in the system that allowed a trusted user to evaluate search results and save feedback. The feedback was then used to evaluate the impact of changing the ranking function parameters one way or the other.
Conclusion
The paper describes Google’s architecture, its distinguishing features and how it was able to produce search results of better quality than most search engines of its time. These were just the initial steps for Google, before it took over the world of search engines as the leader.
A key takeaway here is that a focus on scalability, quality, user feedback and research is something that most large-scale systems benefit from.
