Web Scale Responsive Visual Search at Bing

This post is a summary of the paper:
Web-Scale Responsive Visual Search at Bing, Hu et al., 2018

Introduction

Visual search is an interesting research area. A visual search system ranks a list of visually similar images when presented with a query image.

As in all search systems, latency and relevance of returned results are key metrics to evaluate such a system.

This paper delves into the design of Microsoft Bing’s web-scale general visual search system. The system indexes tens of billions of images, with thousands of features for each image and responds in less than 200 ms. It employs a multi-phase learning-to-rank framework based on deep learned visual features.

Challenges

It is challenging to achieve high relevance, low latency and high storage scalability at the same time in a web-scale visual search system.

Since a web-scale system is not restricted to a certain vertical or a certain website, it contains tens of billions of images. Approaches that work for smaller datasets stop being useful at this scale.

Even with sharding, storage scalability remains a problem to store visual features for each of these images. The features cannot be stored on regular hard disks due to low random access performance.

Modern visual search engines use learning-to-rank architecture to utilize multiple features for better relevance. These features are expensive to calculate which poses another challenge at the web-scale.

In a web-scale search engine, even though the retrieval of images can be parallelized, the query feature extraction, data flow control among sharded instances, and final data aggregation require specific optimizations.

Image content understanding is another capability that a general search engine needs to focus on. Unlike vertical-specific engines, which usually have a controlled image database with well organized metadata, a general visual search engine’s database is unrestricted with unorganized (or unavailable) metadata.

Solving the above challenges require thinking about trade-offs.

Optimization trade-offs

Relevance v/s latency

Learning-to-rank framework using deep-learned features provides great relevance at the cost of latency.

Bing uses a cascaded learning-to-rank framework that operates in multiple phases to trade-off relevance and latency. A sharded inverted index is used in Level-0 to filter out “hopeless” images. In subsequent phases, more expensive visual features are used to re-rank candidates.

Relevance v/s storage

Product Quantization is used to minimize storage requirements. The idea is to approximate feature vectors and reduce the storage space needed to store features. Since feature vectors are approximated, relevance takes a hit.

Latency v/s storage

As mentioned before, regular hard disks have low random-access performance. Since we cannot guarantee sequential-access, therefore, regular hard disks cannot be used for acceptable latencies.

Distributed Solid State Drives are used as a trade-off between latency and storage scalability.

Query Understanding

Recognizing the intent behind a query, whether textual or visual, is an important step towards producing relevant results.

Bing visual search extracts features from the query image to describe its content and generates image caption to capture the key concept in the query image. The system also tailors the experience if a specific “scenario” is triggered.

For example, if the system detects a shopping intent, search results are rich in shopping information like store website links and prices.

This is a noteworthy feature that I’m sure plays a large part in user engagement and retention.

Relevance, latency and storage

Deep neural networks (DNN) have dominated the space of image recognition and classification. Hence, it is not surprising that Bing visual search uses several state-of-the-art DNNs like AlexNet, GoogleNet, ResNet and ZFSPPNet. DNN features, non-DNN visual features, object detection and text matching features are aggregated and used in the ranker.

As is usual in search engines, training data is collected from multiple sources, including human labeling and web scraping.

The system uses a cascading ranking framework. The framework includes Level-0 matching, Level-1 ranking and Level-2 ranking.

Each subsequent level uses more expensive features but operates on a smaller candidate set of images. This has the advantage of producing relevant results with lower latencies in general.

Level-0 matching generates the initial candidate set based on an inverted index. The index is built from visual words trained from joint k-means as cheap representations of images. After quantization, each image is represented by N visual words. The Level-0 matching filters out images whose visual words do not match the query image. This step reduces the size of the candidate set from billions of images to several million images within several milliseconds.

Level-1 ranker further reduces the number of candidate images. It calculates a single DNN feature and is thus, pretty light-weight. This step reduces the size of the candidate set from millions to thousands.

Level-2 ranker utilizes all the DNN features and ranks the candidate set to produce the final ranking. A Lambda-mart ranking model (multivariate regression tree model with a ranking loss function) takes the feature vector and produces a final ranking score. The features are expensive to work with but the relatively small size of the candidate set makes sure that this step executes in several milliseconds.

All features used in the above cascaded learning-to-rank framework need to be stored in memory for efficient retrieval.

Product Quantization (PQ) reduces storage requirements to store DNN features. With PQ, cluster centers are used to approximate feature vectors. The time cost of Euclidean distance calculation is reduced because the distance between cluster centers are pre-computed. The space cost is saved as well because only IDs need to be stored in a dictionary.

Visual words reduce storage requirements in the matching step.

Sharding all features, visual words and PQ IDs into multiple (hundreds) machines help to distribute compute and storage burdens.

The three layer architecture in which each layer has its own specific storage requirements also helps in managing the distributed index more efficiently.

Concluding thoughts

This paper was an interesting read. It lays down challenges of a web-scale visual search system focusing on three main areas: relevance, latency and storage scalability.

DNN features are used to produce good relevance. A cascading learning-to-rank framework provides a scalable way to utilize these features.

The paper also discusses GPU/CPU acceleration that I have not covered here. Also detailed in the paper are results of various experiments that the authors conducted with the proposed system. NDCG@5 score was found higher than other participating models.

Because of optimizations in the cascading reranking framework, the latency bottleneck was actually the retrieval or calculation of visual features. From the storage perspective, PQ showed about 100 to 1000 times saving compared to non-PQ implementations.

All in all, a promising architecture with many interesting directions and further optimizations that one can think of.