With the universal popularity of digital devices embedded with cameras and the fast development of the internet, billions of people are sharing and browsing photos on the Web. The ubiquitous access to both digital photos and the Internet sheds a bright light on many emerging applications based on image search. Image search aims to retrieve relevant visual documents from a textual or visual query efficiently using a large-scale visual corpus.
In this blog post I'll describe the process of creating a landmark classification engine described in this this paper. It's an implementation of a CNN-based local feature with attention, which is trained with weak supervision using image-level class labels only. This new feature descriptor is referred to as DELF (DEep Local Feature), and the following figure illustrates the overall procedure of feature extraction and image retrieval.
The overall architecture of the created image retrieval system, using DEep Local Features (DELF) and attention-based keypoint selection. Left: an illustrated pipeline for extraction and selection of DELF. The portion highlighted in yellow represents an attention mechanism that is trained to assign high scores to relevant features and select the features with the highest scores. Feature extraction and selection can be performed with a single forward pass. Right: illustrated large-scale feature-based retrieval pipeline. DELF for database images are indexed offline. The index supports querying by retrieving nearest neighbor (NN) features, which can be used to rank database images based on geometrically verified matches.
Key issues in image retrieval
In content-based image retrieval, there are two fundamental challenges: intention gapand semantic gap. The intention gap refers to the difficulty that a user suffers to precisely express the expected visual content by a query at hand (such as an example image). The semantic gap originates from the difficulty in describing a high-level semantic concept with low-level visual features.
Technically speaking, there are three key issues in content-based image retrieval:
image similarity measurement.
Image representation originates from the fact that the intrinsic problem in content-based visual retrieval is image comparison. For the convenience of comparison, an image is transformed into some kind of feature space (vector, matrix, etc.). There is a saying that “An image is worth a thousand words”. However, it is nontrivial to identify those “words”. Usually, images are represented as one or multiple visual features. The representation is expected to be descriptive and discriminative to distinguish similar and dissimilar images. More importantly, it is also expected to be invariant to various transformations (translation, rotation, resizing, illumination change, etc.).
In multimedia retrieval, the visual database is usually very large.
It is a nontrivial issue to organize the large scale database to efficiently identify the relevant results of a given query. Ideally, the similarity between images should reflect the semantic relevance, which, however, is difficult due to the intrinsic “semantic gap” problem.
Conventionally, the image similarity in content-based retrieval is formulated based on the visual feature matching results.
The general framework of content-based image retrieval. The modules above and below the dashed line are in the offline stage and online stage, respectively.
The dataset consists of two sets of images: the full version, referred to as LF, containing 118,112 images from 586 landmarks, and the clean version (LC) obtained by a SIFT-based matching procedure (described here), with 29,535 images of 586 landmarks. The dataset can be found here (annotations_landmarks.zip). Only a subset of all images could be downloaded due to broken URLs.
Random 12 images taken from The Landmarks dataset
The Oxford Buildings Dataset consists of 5062 images collected from Flickr by searching for particular Oxford landmarks. The collection has been manually annotated to generate a comprehensive ground truth for 11 different landmarks, each represented by 5 possible queries. This gives us a set of 55 queries over which an object retrieval system can be evaluated. The following image shows all 55 queries used to evaluate performance over the ground truth.
All 55 queries used to evaluate performance over the ground truth for Oxford5k
Each image and landmark was assigned one of four possible labels:
Good - A nice, clear picture of the object/building.
OK - More than 25% of the object is clearly visible.
Junk - Less than 25% of the object is visible, or there are very high levels of occlusion or distortion.
Bad - The object is not present.
The large-scale retrieval system can be decomposed into four main blocks:
dense localized feature extraction,
indexing and retrieval.
The focus was on the construction of the DELF model: feature extraction and keypoint selection.
The network architectures used for training: left - Descriptor Fine-tuning (feature extraction), right - Attention-based Training (keypoint selection)
Dense localized feature extraction
Dense features are extracted from an image by applying a fully convolutional network (FCN), which is constructed by using the feature extraction layers of a CNN trained with a classification loss. ResNet50 model was employed, using the output of the conv4 _x convolutional block. The obtained feature maps are regarded as a dense grid of local descriptors. Features are localized based on their receptive fields, which can be computed considering the configuration of convolution and pooling layers of the FCN. Pixel coordinates of the center of the receptive field are used as the feature location. The original ResNet50 model trained on ImageNet was used as a baseline and fine-tuned for enhancing the discriminativeness of the local descriptors. Since a landmark recognition application is considered, annotated datasets of landmark images are employed and the network was trained with a standard cross-entropy loss for image classification.
ResNet50 architecture for input images of dimensions 224x224. Downsampling is performed by conv3_1, conv4_1, and conv5_1 with a stride of 2
Instead of using densely extracted features directly for image retrieval, a technique to effectively select a subset of the features is designed. Since a substantial part of the densely extracted features are irrelevant to the recognition task and likely to add clutter (distracting the retrieval process), keypoint selection is important for both accuracy and computational efficiency of the retrieval systems. A score function is created to explicitly measure the relevance of each extracted feature from a query image. The score function is implemented as a 2-layer CNN with a softplus activation at the top. For simplicity, convolutional filters of size 1×1 are employed, which work well in practice.
In the proposed framework, both the descriptors and the attention model are implicitly learned with image-level labels for a classification task. Unfortunately, this poses some challenges to the learning process. While the feature representation and the score function can be trained jointly by backpropagation, experiments have shown that this setup generates weak models. Therefore, a two-step training strategy is employed. First, the descriptors are learned with fine-tuning the original ResNet50, and then the score function is learned given the fixed descriptors. LF (full version of The Landmarks dataset) is used to train the attention model, and LC (clean version of The Landmarks dataset) is employed to fine-tune the network for image retrieval.
Feature extraction and matching
Once the attention model is trained, it can be used to assess the relevance of the features extracted by the model. When a query is given, approximate nearest neighbor search (kDTree) is performed for each local descriptor extracted from a query image. Then the top K (K=1000 is used) nearest local descriptors are retrieved. Finally, a geometric verification is performed using RANSAC and the number of inliers is used as the score for retrieved images.
Left: query image.
Center: K=1000 localized features with the highest attention score labeled with a white pixel.
Right: K=1000 localized features with the highest attention score labeled with a white pixel whose value is scaled with the corresponding attention value
Keypoint matching visualization for a query image and the corresponding image with the highest score (number of inliers) from the Oxford5k image database.
Image retrieval systems have typically been evaluated based on mean average precision (mAP), which is computed by sorting images in descending order of relevance per query and averaging AP of individual queries. AP summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.
Results on Oxford5k
In the table we can see the performance evaluation in mAP(%) for Oxford5k dataset. The original ResNet50 trained on ImageNet was used as a baseline. FT denotes fine-tuning, while ATT corresponds to the attention model. LIFT, siaMAC and DIR are current SOTA (state of the art) methods. Asterix (*) denotes the methods that were not trained on the same dataset as DELF. The created model outperformed the original by a small margin because the authors used methods like PCA, PQ, etc. while creating their retrieval system. By losing a small portion of the performance, the authors gained better retrieval times.
To summarize, I presented DELF, a new local feature descriptor that is designed specifically for large-scale image retrieval applications. DELF is learned with weak supervision, using image-level labels only, and is coupled with the attention mechanism for semantic feature selection. In the proposed CNN-based model, one forward pass over the network is sufficient to obtain both keypoints and descriptors.