Information Retrieval : Basics

Information Retrieval

Information Retrieval (IR) is the science of searching for some material in a collection of resources based on an information need.

The most common form of an IR system is a text retrieval system that allows users to retrieve textual data in the form of a sorted list of documents using natural language queries. Text retrieval forms the fundamental basis of most search engines.

IR systems can be distinguished by the scale at which they operate.

On one end is web search where the system provides search capabilities over billions of documents stored on millions of computers.

On the other end is personal information retrieval. These systems run on a user’s operating system or within an application like an email client, for example, providing search capabilities over files stored on the user’s computer.

In between is the space of enterprise, institutional, and domain-specific search where search capabilities may be provided over a corporation’s internal documents.

Distinctive issues for each of the above scales of operation exist.

Basic Terminology

A document is the unit that we have decided to build a retrieval system on. For example, a web search engine might treat a web page as a document.

Group of all documents over which we perform retrieval is known as the collection or the corpus.

The most standard IR task is the ad hoc retrieval task in which the IR system aims to provide documents that are relevant to an arbitrary user information need communicated to the system by means of a user-initiated query. The task is named such because the user creates an “ad hoc” query.

Thus, the information need is the topic that the user desires to know more of.

A query is the means by which the user attempts to convey the information need to the system.

A document is relevant if the user perceives it as containing information that satisfies the information need.

Why do we need more sophisticated IR systems if we have Grep?

The simplest form of IR is for a computer to read through all the documents in the collection and remember the documents that match the user’s information need (query).

In fact, this is how the Unix command-line utility grep searches for text and hence, the process is commonly referred to as grepping through text.

Simple querying of modest collections do not usually require a more complex IR system.

A more complex IR system is needed when:

the collection size is large and queries need to be processed quickly. This is true for almost all search engines that perform retrieval over online data.
the information need requires more flexible matching operations. For example, grep is not ideal to process queries that contain proximity constraints like “Caesar NEAR Brutus” where NEAR is defined as “within 5 words”
we require ranked retrieval, i.e., we want to be able to say that a particular document is better than some other document in satisfying the information need, even though both documents match the query.

Index

The first step towards building an efficient IR system that can process large collections is to avoid linear scanning of documents for each query. We achieve this by indexing the documents in advance.

An index, more commonly referred to as an inverted index, maps terms to parts of a documents where they occur. Terms are the products of tokenization, linguistic preprocessing and normalization of the text in the documents. For now, we assume that tokens, normalized tokens and terms are loosely equivalent to words.

We keep a dictionary of terms, also known as the term vocabulary or lexicon. For each term, we keep a list called the postings list. Each item in the list is called a posting. A posting records which document the term (to which this posting is mapped to) appears in. The posting may also contain other information like the positions in the document at which the term appears.

All the postings lists taken together are referred to as postings.

The major steps to build the index are as follows:

Collect the documents to be indexed.
Tokenize the text. This step converts each document into a list of tokens.
Tokens need to be normalized. This step also performs language aware text analysis or linguistic preprocessing. The result of this step is a list of normalized tokens which become the indexing terms.
Create an inverted index consisting of the term vocabulary and corresponding postings.

During index construction, we can assign successive integers to each new document when it is first encountered which becomes the document identifier (docID). The docID is a unique serial number that can identify a document in a collection. Hence, each item in a postings list need to store only the docID to record that a term appears in a particular document.

Each postings list is sorted by the docID. This provides the basis for efficient query processing as we’ll see later.

The term dictionary can also record some statistics, such as the number of documents which contain each term (the document frequency, which is also the length of each postings list in the model above). Statistics like document frequency may be used at query time to improve the efficiency of a search and/or to improve the relevance ranking in ranked retrieval models.

Retrieval Models

IR systems based on many different retrieval models or retrieval strategies exist.

An IR model defines a way to represent the contents of a document and a query. It also defines how the document and query representations can be compared with each other to match documents to a query and to rank matching documents.

In this post, we will focus on one of the simplest models - the Boolean Retrieval Model. We will talk about other retrieval models in subsequent posts in this series.

Boolean Retrieval Model

The Boolean retrieval model is a model in which a query takes the form of a Boolean expression of terms. Terms are combined with the operators AND, OR, and NOT to compose queries.

This model views each document as just a set of words. In other words, documents are modeled as logical conjunction of words.

The Boolean retrieval model is a set-based model.

Let us look at an example. Suppose that document D1 contains terms t1 and t2 and document D2 contains the terms t2, t3 and t4.

The Boolean model represents the documents as sets according to the following conditions:

D1 = t1 AND t2
D2 = t2 AND t3 AND t4

Assume that the user issues the following queries:

Q1 = t2
Q2 = (t1 OR t3) AND (NOT t4)

According to the Boolean model:

Q1 matches both D1 and D2
Q2 matches only D1

Query processing

Given an inverted index and the basic Boolean retrieval model, how do we go about processing a user issued query?

Assume that the user issues a simple conjunctive query:

Q = cat AND dog

Find cat in the term vocabulary
Retrieve its postings. Say the postings contain docIDs {1, 2, 3, 6, 99, 100}
Find dog in the term vocabulary
Retrieve its postings. Say the postings contain docIDs {3, 51, 54, 76, 99, 290}
Intersect the two posting lists. In this case, the intersection is {3, 99}
Return the intersection as the search result

Postings List Intersection / Postings Merge

A critical part of query processing is intersecting two or more posting lists. Postings list intersection is sometimes also called postings list merge, owing to the fact that the algorithm falls under the general family of algorithms that combine (merge) multiple sorted lists by interleaving advancing of pointers through each list.

A simple method to perform the intersection is by using the following algorithm:

INTERSECT(p1, p2)
  answer ← {}
  while p1 != NIL and p2 != NIL
  do if docID(p1) == docID(p2)
    then ADD(answer, docID(p1))
      p1 ← next(p1)
      p2 ← next (p2)
    else if docID(p1) < docID(p2)
      then p1 ← next(p1)
      else p2 ← next(p2)
  return answer

This algorithm runs in time linear to the length of the postings lists. Note that the algorithm assumes that postings lists are sorted by docID. As we noted earlier, sorting the postings during index construction is key to achieve efficiency in query processing.

The above algorithm can be extended to arbitrarily complex boolean queries.

More in this series

In subsequent posts of this series, focusing on information retrieval, we’ll go through more advanced concepts like using proximity operator to extend the Boolean model, employing query optimization techniques, constructing indexes efficiently, distributed and dynamic indexing, ranked retrieval models, tf-idf functions, probabilistic information retrieval and web search to name a few.

References

Christopher D. Manning et al. Introduction to Information Retrieval. Cambridge, 2008

Information Retrieval : Basics

On this page