Automatic Image Annotation by Classification Using SIFT Features

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 3, Issue 3, June 2014
Automatic Image Annotation by classification using SIFT features

Kalyani B.Bobade1, Shital V.Jagtap2
1
(Department of Computer Engineering, Vishwakarma Institute of Technology, Pune

Email: kalyani.2341@gmail.com)
2
(Department of Computer Engineering, Vishwakarma Institute of Technology, Pune
Email: svjagtap@gmail.com)
ABSTRACT
With the growth of digital technologies, ever
increasing visual data is created and stored. Thus the
explosive growth of image data leads to the growth
and development in content based image retrieval
(CBIR). However research in this area shows that
there is a semantic gap between image semantics and
content based image retrieval. Thus to bridge the gap
automatic image annotation technique (AIA) is used.
In this paper, we focus on Automatic Image
Annotation technique, which extract features using
SIFT technique. Scale Invariant Feature Transform
(SIFT) algorithm is a machine learning technique
which employs to extract features and key points.
Visual words are then constructed using k means
clustering. Accuracy of registered image depends on
accurate feature detection and matching. Hence,
SIFT technique is robust and accurate method for
Automatic Image registration.
Keywords AIA automatic image annotation, BoW bag
of words model, DoG Difference of Gaussian, SIFT
scale invariant feature transform, visterms visual words
1. INTRODUCTION
With the advent of digital technologies, the number
of digital images has been growing rapidly and there is a
need for effective and efficient tool to find visual
information. A huge amount of information is available,
and everyday gigabytes of visual information is being
generated, transmitted and stored. A large amount of
research has been carried out in image retrieval area.
Systems using non-textual (image) queries have been
proposed but many users find it hard to represent their
information needs using abstract image features. Most
users prefer textual queries and this has been usually
achieved by manually providing keywords or captions
and searching over these captions using a text query.

Manual annotation is an expensive and tedious
procedure and most images are never likely to be
captioned in this way.
Automatic Image Annotation (also known as
automatic image tagging) is the process of the computer
system automatically assigning keywords to a given
digital image, and this is used in image retrieval systems
to locate images that are queried. In this work, we focus
on image annotation which extracts feature vectors and
learns the correlations between image features and
training annotations via machine learning techniques. It
then automatically applies annotations to new images
based on the visual word" information. The advantages
of automatic image annotation versus content-based
image retrieval are that queries can be more naturally
specified by the user. The traditional methods of image
retrieval such as those used by libraries have relied on
manually annotated images, which is expensive and
time-consuming, especially given the large and
constantly-growing image databases in existence. In this
work, we develop an algorithm to perform automatic
image annotation for a set of images.
In this paper, we introduce the machine learning
technique called Scale Invariant Feature Transform for
automatic image annotation. The rest of the paper is
organized as follows. The section 2 describes Learning
System Overview, section 3 describes process of SIFT
descriptor, section 4 introduce details about Visual
word formation, section 5 describes the tag assignment
process for automatic image annotation, section 6
contains description of experimental results. And
remaining sections contains conclusion and references.
2. RELATED WORK
In this section, we review some of the popular
methods for automatic image annotation. Automatic
www.ijsret.org
713
image annotation is regarded as a type of multi-class

image classification. Recently it was suggested that
methods use region based image descriptors generated by
automatic segmentation or through fixed shapes may lead
to poor performance as it doesnt give strong feature set.
Even block based segmentation fails to give accurate
results. Hence the features are not robust to scale, rotation
and illumination.
Recently, graph-based semi-supervised learning has
attracted much attention in both machine learning and
multimedia retrieval communities. Actually graph based
semi-supervised learning is a label propagation process
[Tang et al. 2007]. The most typical ones include the
Gaussian random fields and harmonic functions method
[Zhu et al. 2003], as well as the local and global
consistency method [Zhou et al. 2003]. However, they
both have the disadvantage of the requirement to tune
certain parameters.
Another popular method is the linear neighborhood
propagation [Wang and Zhang 2008,2], in which the
sample reconstruction method is used to construct a
graph. It has been shown that in most cases, linear
neighborhood propagation is more effective and robust
than the traditional semi-supervised methods on
similarity graphs [Wang and Zhang 2008,2][Tang et al.
2008]. However, it still cannot handle the links among
semantically-unrelated samples. A more detailed survey
on semi-supervised learning can be found in [Zhu 2005].
visterms model for each image, and this can then be

processed in a similar fashion to bag-of-words models for
text documents [6, 7].
In this paper we use label me database which has
annotation files attached with the images. For each image
they provide annotations assigned to those images
according to their coordinates. The label me database is a
large database hence can be used for training. As to
check the accuracy of the implemented work the large
dataset would help us to map the performance according
to the experiments performed.
3. LEARNING SYSTEM OVERVIEW

Before turning to the details of the automatic image
annotation using SIFT technique, this section gives an
overview of our proposed system. The overall system
flow is illustrated in Fig.1. Given set of images, for each
image we extract visual and textual features. The visual
feature extraction is done using SIFT descriptor which is
described in further section of this paper. With the
extracted features a visual word dictionary is formed,
which helps for the image feature matching.
Subsequently, algorithm is applied to assign tags to the
images (untagged images).
Duygulu et al. [3] described images using a

vocabulary of blobs. First, regions are created using a
segmentation algorithm like normalized cuts. For each
region, features are computed and then blobs are
generated by clustering the image features for these
regions across images. Each image is generated by using
a certain number of these blobs. Their Translation Model
applies one of the classical statistical machine translation
models to translate from the set of blobs forming an
image to the set of keywords of an image.
Recently in [2,3], it was suggested that methods that
use region-based image descriptors generated by
automatic segmentation or through fixed shapes may lead
to poor performance, as regularly used rectangular
regions image descriptors are not robust to a variety of
transformations such as rotation. They have suggested
using Scale Invariant Feature Transformation (SIFT) [5]
feature, which are scale invariant, and utilizing them as
visual terms in a document. We then have a bag-of-
www.ijsret.org
Fig.1 Learning system overview
714
Thus to build the word and image correlation model

we make use of visual word formation from the features
detected with help of SIFT. And train the visual word and
the objects to map or annotate the set of untagged image.
In our implementation we have focused on automatically
assigning tags to the images and to improve the
performance of automatic image annotation method.
2. Keypoint localization: At each candidate

location, a detailed model is fit to determine
location and scale. Keypoints are selected based
on measures of their stability.
3. Orientation assignment: One or more
orientations are assigned to each keypoint
location based on local image gradient
directions. All future operations are performed
on image data that has been transformed relative
to the assigned orientation, scale, and location
for each feature, thereby providing invariance to
these transformations.
1. SIFT IMPLEMENTATION
Scale invariant feature transform (SIFT) is used for
extracting distinctive invariant features from images that
can be invariant to image scale and rotation. This
method was proposed by David Lowe in 1999 and has
been applied in many areas. Figure 1 shows the steps
involved in detecting image features using SIFT [4].
Image
4. Keypoint descriptor: The local image gradients

are measured at the selected scale in the region
around each keypoint. These are transformed
into a representation that allows for significant
levels of local shape distortion and change in
illumination.
Scale space image

representation
Contrast based edge
filter
4.1 Scale space extrema detection

Keyword computation
by DoG
Contrast based edge

filter
SIFT descriptor
Keypoints are detected using a cascade filtering

approach that uses algorithms to identify candidate
locations. The SIFT detector and descriptor are
constructed from the Gaussian scale space of the source
image I(x).
The scale space of an image is defined as a function,
L(x, y, ), that is produced from the convolution of a
variable-scale Gaussian, G(x, y, ), with an input image,
I(x, y):
Keypoint orientation
Where * is the convolution of x and y
Fig.2. Steps followed for computing SIFT descriptor.
There are several stages of computation used to find

set of features. They are as follows
1. Scale-space extrema detection: The first stage
of computation searches over all scales and
image locations. It is implemented efficiently by
using a difference-of-Gaussian function to
identify potential interest points that are
invariant to scale and orientation.
To efficiently detect stable keypoint locations in

scale space, we have proposed (Lowe, 1999) using scalespace extrema in the difference-of-Gaussian function
convolved with the image, D(x, y, ), which can be
computed from the difference of two nearby scales
separated by a constant multiplicative factor k:
www.ijsret.org
715
Fig.4 Maxima and minima of the differenceof-Gaussian images are detected by comparing a
pixel (marked with X) to its 26 neighbors in 3x3
regions at the current and adjacent scales (marked
with circles).
4.3 Keypoint Localization
Fig.3 For computing difference-of Gaussian each
octave of scale space is convolved with Gaussians to
reduce set of scale space images, which are then
subtracted to produce difference-of-Gaussian. After
each octave, the Gaussian image is down-sampled by
a factor of 2, and the process repeated.
It is a particularly efficient function to compute, as
the smoothed images, L, need to be computed in any
case for scale space feature description, and D can
therefore be computed by simple image subtraction, as
stated in [5].
4.2 Extrema detection
This stage is used to find extrema points in DoG
pyramid. In order to detect the local maxima and minima
of D(x, y, ), each sample point is compared to its eight
neighbors in the current image and nine neighbors in the
scale above and below i.e. 26 neighbors. The points are
selected only if it is larger than all of these neighbors or
smaller than all of them. The cost of this check is
reasonably low due to the fact that most sample points
will be eliminated following the first few checks. The
following Fig.4 shows how sample points are considered
and compared[5].
Once the keypoint locations are found they need

to be refined to get more accurate results. Hence
keypoint localization is used for a detailed fit to the
nearby data for location, scale, and ratio of principal
curvatures. Thus, this information allows points to be
rejected that have low contrast (and are therefore
sensitive to noise) or are poorly localized along an edge.
Thus the Taylor expansion (up to the quadratic
terms) of the scale-space function, D(x, y, ), shifted so
that the origin is at the sample point:
x
where, D and its derivatives are evaluated at the
sample point and X = (x, y, )T is the offset from this
point. The location of the extremum, , is determined by
taking the derivative of this function with respect to X
and setting it to zero, giving
The Hessian and derivative of D are

approximated by using differences of neighboring
sample points. The resulting 3x3 linear system can be
solved with minimal cost. If the offset
is larger than
0.5 in any dimension, then it means that the extremum
www.ijsret.org
716
lies closer to a different sample point. For the

experiments according to David Lowe, all extrema with
a value of |D ( )| less than 0.03 were discarded (as
before, we assume image pixel values in the range [0, 1])
[5].
4.4 Keypoint Orientation and descriptor
An orientation is assigned to each candidate
keypoint to achieve invariance to image rotation. To
determine the keypoint orientation, a gradient orientation
histogram is computed in the neighborhood of the
keypoint. The contribution of each neighboring pixel is
weighted by the gradient magnitude and a Gaussian
window with equal to 1.5. Once a keypoint orientation
has been selected, the feature descriptor is computed as a
set of orientation histograms on 4 4 pixel
neighborhoods.
The orientation histograms are relative to the
key point orientation, the orientation data comes from
the Gaussian image closest in scale to the key points
scale. Histograms contain 8 bins each, and each
descriptor contains an array of 4 histograms around the
key point. This leads to a SIFT feature vector with 4 4
8 = 128 elements. This vector is normalized to
enhance invariance to changes in illumination.
16 histograms x 8 orientations
= 128 features
Fig.5 Finding the keypoint descriptors
5. VISUAL WORD REPRESENTATION
The visual word representation is also called as
Bag of word model also known as bag of features or bag
of visterms model for object recognition. The bag of
word model was firstly proposed in the text retrieval
domain problem for text recognition. The bag of word
model in case of image analysis is based on vector
quantization processed by clustering low level features

of local regions or points [6].
In the BoW model an image can be treated as a
document, and features extracted from the image are
considered as the visual words. The Bag of word model
from image involves following steps:
1. Automatically detect regions or points of
interest.
2. Compute local descriptors over those regions
3. Quantize descriptors into words to form visual
vocabulary.
4. Find occurrences in the image of specific word
in vocabulary for constructing BoW model
The fig.6 gives an overview of several steps
included in Bag of word formation. Bag of words model
is mainly designed for the local descriptors of images
which mainly describe regions around keypoints
detected in the images. Different from global features
describing a picture in holistic way, one image can have
bunch of salient patches around keypoints.
Since the annotation accuracy is heavily dependent
on feature representation, hence we have used SIFT for
extracting features. As SIFT provides a flexible
representation that allow us to match images invariant of
scale, illumination, rotation etc. But once we extract
such local descriptors in the query image becomes too
time consuming. Therefore, BOW is proposed as a way
to solve this problem by quantizing descriptors into
visual words. SIFT results to give keypoint descriptors
which are then clustered using kmeans clustering
approach. We have used large number of clusters with
respect to the number of local features obtained in the
training data.
The k-mean algorithm clusters the feature vector
into k clusters. Now, among the feature vectors
belonging to one training images, we can have a count of
feature vectors in each cluster. Each cluster can be
conceptually understood as a visual word. Therefore, for
each training image, there's a visual word distribution
associated to this image. In fact, visual word
representation is the representation of the contents or
object of the image. In addition to the visual word
distribution theres collection of tags assigned to this
image.
www.ijsret.org
717
1. Keypoint detection
2. Feature extraction
Detect key
points
Compute
descriptors
3.Vector quantization
Cluster
descriptors
4.Bag of words
Build
histograms
Fig.6 Four steps for constructing the bag-of-words for image representation
6. TAG ASSIGNMENT
This section will introduce with the whole
version of our implementation for automatic image
annotation. It measures the correlation between the
visual words and labels or tags and automatically
annotates images. The effectiveness of the
implementation is computed by using large set of
training and testing images.
Firstly we try to build visual word distributions
for all training images, thus we can assign labels to the
visual word distributions. As the labels need to be
learned from their visual word distribution from all the
training images, each training image is introduced to all
the labels which are stored as an annotation file for all
the training images.
Before training an image to all labels, the labels
have no knowledge of the visual word distribution. Thus
from the annotated dataset we try to map the objects in
an image with the label set and maintain a visual word
and label co-relation. Then again the visual words and
label information is trained to get the correlation among
the training set of images.
7. EXPERIMENTAL RESULTS
All of the experiments were based on the
labelMe datasets. For the labelMe dataset [8], each
image is presented in visual and textual high
dimensional features. For the above specified dataset we
consider 1200 ground truth images. We extracted the
visual and textual features in the following ways. We
took visual word as visual features for similarity

computation. For visual word formation we adopted
SIFT descriptors. The SIFT features were then quantized
into clusters using kmeans clustering; where each cluster
define a visual word containing the feature descriptors
i.e. feature points in this cluster.
For the textual features we obtained the data from label
me dataset which was an xml file holding set of tags. We
extracted the textual features by getting the co-ordinates
value and the annotation for those particular co-ordinates
which was representing an object in an image. Then the
retrieved information was then stored in separate file
consisting all textual data. We have made use of lucene
tool for indexing and searching due to large database. It
is highly optimized and highly modularized. The BM25
tool of lucene has been used to improve the search
result.
To evaluate the annotation performance, we retrieve
images using keywords from the vocabulary based on
ranked retrieval. We can judge the retrieved results by
comparing it with the actual annotated set assigned to
each image. Hence to show the performance of our
proposed work we have made use of precision and
recall. The precision is the number of correctly retrieved
images divided by the number of retrieved images. The
recall is the number of correctly retrieved images
divided by the number of relevant images in the test
dataset. Given a ranked set of items in response to query
we compute precision and recall as:
www.ijsret.org
718
carSide,black car side crop,road,wheel,pot,
wheel rim,pane,cars side,streetlight,builing,
window,sidewalk,tree,building,fence,trash,wall,d windows,building,tree,car crop,grille,terrace

oor,occluded,window occluded,tail light,wheel
rim,door handle, mirror,door,headlight,pane,litter balcony,window,buildings,hedge,grass,bandstand
,bicycle,carside,occluded,carSide,trees,fire
bin,plant
escape,trafficlights,plants,shop,window,staircase,
car side,bench,sidewalk,car occluded,window
occluded,handrail,fence
Original Annotations
Automatic annotation
Fig.7 Results showing automatic annotations compared with automatic annotation

Prior the above mentioned experimental results the
annotations done are more accurate than the other
methods. Thus, our implementation can be used on large
set of image dataset and is more efficient method for
automatic image tagging.
And
8. CONCLUSION AND FUTURE WORK

We are now able to compute Fmeasure which is
harmonic mean of precision and recall. This can be
computed as follows:
The precision and recall results for the above

example are:
Precision
0.60
Recall
2.0
Fmeasure
0.933
We have presented a new method for automatic

image annotation which unifies the visual and textual
information from large set of images. Our framework
uses a preprocessing step so as to remove noise or
unwanted data and achieve the best performance by
extracting strong feature set with use of SIFT. On the
other hand, we have also presented image to image and
image to label correlation model so as to remove the
weak labeling concept. Experimental results have shown
that our algorithm can operate on a large-scale image
dataset while effectively infer the image labels. Future
work includes the development of learning on largescale audio and video datasets.
www.ijsret.org
719
9. REFERENCES
1. Li, X., Chen, L., Zhang, L., Lin, F., and Ma, W.Y., Image annotation by large-scale contentbased image retrieval, 2006.
2. Wang, X.-J., Zhang, L., Jing, F., and Ma, W.-Y.,
Annosearch: Image auto-annotation by search.
In IEEE Conference on Computer Vision and
Pattern Recognition, New York, USA, 2006.
3. Kobus Barnard, Pinar Duygulu, David Forsyth,
Nando de Fretias, David M. Blei,and Michael I.
Jordan. Matching words and pictures. Journal
of Machine Learning Research, 3:11071135,
2003.
4. Wikipedia, Scale-invariant feature transform
Wikipedia,
The
Free
Encyclopedia,http://en.wikipedia.org/w/index.ph
p?title=Scale-invariantfeature
transform&oldid=304881559.
5. D. G. Lowe, Distinctive image features from
scale-invariant keypoints , International Journal
of Computer Vision, vol. 60, no. 2, pp. 91110,
2004.
6. Chih-Fong Tsai, Bag-of-words representation
in image annotation, International Scholarity
Research Network of Artificial Intelligence,
vol.2012.
7. Christian Hetschel, Sebastian Stober, Andreas
Nurnberger and Marcin Detyniecki, Automatic
Image Annotation using a visual dictionary
based on reliable Image Segmentation, LNCS
4918,pp. 45-56, 2008.
8.
LabelMe: a database and web-based tool for

image annotation . B. Russell, A. Torralba, K.
Murphy, W. T. Freeman. International Journal
of Computer Vision, 2007.
www.ijsret.org
720

Automatic Image Annotation by Classification Using SIFT Features

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Image Annotation by Classification Using SIFT Features

Uploaded by

Copyright:

Available Formats

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882

Volume 3, Issue 3, June 2014

Automatic Image Annotation by classification using SIFT features

(Department of Computer Engineering, Vishwakarma Institute of Technology, Pune

and searching over these captions using a text query.

image annotation is regarded as a type of multi-class

visterms model for each image, and this can then be

3. LEARNING SYSTEM OVERVIEW

Duygulu et al. [3] described images using a

Fig.1 Learning system overview

Thus to build the word and image correlation model

2. Keypoint localization: At each candidate

4. Keypoint descriptor: The local image gradients

Scale space image

4.1 Scale space extrema detection

Contrast based edge

Keypoints are detected using a cascade filtering

Fig.2. Steps followed for computing SIFT descriptor.

There are several stages of computation used to find

To efficiently detect stable keypoint locations in

Once the keypoint locations are found they need

The Hessian and derivative of D are

lies closer to a different sample point. For the

quantization processed by clustering low level features

took visual word as visual features for similarity

carSide,black car side crop,road,wheel,pot,

wheel rim,pane,cars side,streetlight,builing,

window,sidewalk,tree,building,fence,trash,wall,d windows,building,tree,car crop,grille,terrace

Fig.7 Results showing automatic annotations compared with automatic annotation

8. CONCLUSION AND FUTURE WORK

The precision and recall results for the above

We have presented a new method for automatic

LabelMe: a database and web-based tool for

You might also like