Visual Perception and Object Recognition

1
Bag of Words
Ernest Valveny
ernest@cvc.uab.es
Module 3: Advanced Computer Vision
Master Computer Vision and Artificial Intelligence
The framework
Pr obl em 1: Object Recognition
F t I
Photometric Classification
Feature
Extraction
Image
Representation
Image
Formation
Learning Decision
MODULE II
2
Of all the sensory impressions proceeding to
the brain, the visual experiences are the
dominant ones. Our perception of the world
around us is based essentially on the
messages that reach the brain from our eyes.
China is forecasting a world trade surplus of
$90bn (51bn) to $100bn this year, a
threefold increase on 2004's $32bn. The
Commerce Ministry said the surplus would be
created by a predicted 30% jump in exports to
The framework
g y
For a long time it was thought that the retinal
image was transmitted point by point to visual
centers in the brain; the cerebral cortex was a
movie screen, so to speak, upon which the
image in the eye was projected. Through the
discoveries of Hubel and Wiesel we now
know that behind the origin of the visual
perception in the brain there is a considerably
more complicated course of events. By
following the visual impulses along their path
to the various cell layers of the optical cortex
sensory, brain,
visual, perception,
retinal, cerebral cortex,
eye, cell, time, optical
world, nerve, image
Hubel, Wiesel
y p j p p
$750bn, compared with a 18% rise in imports
to $660bn. The figures are likely to further
annoy the US, which has long argued that
China's exports are unfairly helped by a
deliberately undervalued yuan. Beijing
agrees the surplus is too high, but says the
yuan is only one factor. Bank of China
governor Zhou Xiaochuan said the country
also needed to do more to boost domestic
demand so more goods stayed within the
country China increased the value of the
China, world, trade,
surplus, commerce,
exports, imports, US,
yuan, bank, domestic,
foreign, increase,
trade, value, time
to the various cell layers of the optical cortex,
Hubel and Wiesel have been able to
demonstrate that the message about the
image falling on the retina undergoes a step-
wise analysis in a system of nerve cells
stored in columns. In this system each cell
has its specific function and is responsible for
a specific detail in the pattern of the retinal
image.
country. China increased the value of the
yuan against the dollar by 2.1% in July and
permitted it to trade within a narrow band, but
the US wants the yuan to be allowed to trade
freely. However, Beijing has made it clear that
it will take its time and tread carefully before
allowing the yuan to rise further in value.
Slide credit: Li Fei-Fei
The framework
A B C
D E
F
3
The framework
3
4
5
6
7
8
0
1
2
3
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
The framework
Learning Recognition
Classification
0
1
2
3
4
5
6
7
8
Feature detection
and extraction
r
y

c
o
n
s
t
r
u
c
t
i
o
n
u
s
t
e
r
i
n
g
)
Image representation
Visual Words
D
i
c
t
i
o
n
a
r
(
C
l
u
g p
4
Object Object Bag of words Bag of words
The framework
The framework
5
learning learning
feature detection
codewords dictionary codewords dictionary
recognition recognition
feature detection
& representation
image representation
category category
decision decision
category models category models
(and/or) classifiers (and/or) classifiers
The framework
Feature detection
5
6
7
8
Feature extraction
Feature detection
and extraction
0
1
2
3
4
Dictionary construction
Image representation
Classification
6
feature detection
Representation Representation
11
2. 2.
feature detection
& representation
image representation
1. 1.
3. 3.
1. Feature detection and representation
Regular grid
Vogel & Schiele, 2003: 10 x 10 grid. Patches all the
region
F i F i & P 2005 Fei-Fei & Perona, 2005:
Regular grid spaced at 10 x 10 pixels. Size of patch random
between 10 and 30
Random grid. 500 randomly located patches. Size random
between 10 and 30
Regular grid at several scales
Perronin, Dance, et al., 2006. Around 500 patches per
image
7
Interest point detector
Csurka, et al. 2004:
Harris affine detector. Detection of corners (areas
with intensity changes at all directions)
with intensity changes at all directions)
Sivic, et al. 2005:
Harris affine detector
Maximally stable regions (blobs of high contrast with Maximally stable regions (blobs of high contrast with
respect to the surroundings area)
(blue: Harris, yellow: MSR)
Interest point detector
Fei-Fei & Perona, 2005:
Kadir&Brady Saliency detector (based on entropy
meassures) Between 100 200 points Scale from 10
meassures). Between 100-200 points. Scale from 10
to 30
Lowes DoG (SIFT) detector. Difference of gaussians ( ) g
at different scales (stable points in scale space).
Between 100-500 points. Scale from 20 to 120.
8
Comparison of feature detection methods
[Nowak et al. 06]: SIFT (log), Harris-Laplace and Random Sampling
[Fei-Fei & Perona, 2005]
Normalize
patch
Detect patches
Compute
descriptor
Descriptors
SIFT [Lowe99]: local histograms of gradient orientations.
Color Histograms [Vogel & Schiele, 2003]
Edge direction histograms [Vogel & Schiele, 2003]
Gray-level co-ocurrence matrix [Vogel & Schiele, 2003]
SURF [Bay et al., 2008]

9
2. Dictionary formation

Vector quantization
10
Cl ust er i ng
K-means
Gaussian Mixture Models
Vocabul ar y const r uct i on
Unsupervised:
Universal vocabulary taking features from all classes
Simpler, but computationally more expensive
Difficult to represent particularities of every class
Need to recompute if we add more classes
Supervised
Combination of specific vocabularies for every class
computed taking only features of that class.
More complex, but faster
Redundant words representing common similarities between
classes
Adapt ed v ocabul ar i es [Perronnin et al., 2006]
For each class, form a class-specific vocabulary by merging
universal and specific vocabularies
L i l b l i GMM Learn a universal vocabulary using GMM
Adapt it efficiently to obtain class specific vocabulary (MAP)
Assumption
If an image belongs to a class, it is best described by the adapted
vocabulary of this class
If not, it is best described by the universal vocabulary
11
Adapt ed v ocabul ar i es
First learn a universal vocabulary (MLE)
Then adapt it efficiently to obtain class specific vocabulary (MAP)
EM l ith t l th GMM
MAP
EM algorithm to learn the GMM
MLE E Step
M Step
E Step
M Step
Adapt ed v ocabul ar i es
For each image and for each class, compute bi - par t i t e
histograms:
Cl ifi b l Class-specific vocabulary
Universal vocabulary
One histogram per class
12
Patch relevance on Flowers category
Patch relevance on Sky category Patch relevance on Boat category
Si ze of t he vocabul ar y [Nowak et al. 06]
13
3. Image representation
e
q
u
e
n
c
y
..
f
r
e
codewords
Har d assi gnment
Every feature vector is assigned only to the closest word
(center of the cluster)
For each feature vector x
t
:
Relative frequency of word i in the histogram:
f
r
e
q
u
e
n
c
y
frequency
of word i
14
Sof t assi gnment
Every feature vector can be assigned to different clusters:
Using the distance to the cluster center in a k-mean clustering
Using posterior probabilities in GMM:
Every word is represented by a Gaussian in the GMM
Probability of a feature vector x for a given Gaussian:
Probability of a feature vector x
t
being generated by Gaussian i:
Relative frequency of word i in the histogram:
f
r
e
q
u
e
n
c
y
frequency
of word i
Learning and Recognition Learning and Recognition
category category
decision decision
15
4. Learning and Recognition
1. Generative method:
Nave Bayes classifier
Csurka Bray, Dance & Fan, 2004
Hierarchical Bayesian text models 2
3
4
5
c
la
s
s
d
e
n
s
itie
s
p(x|C )
p(x|C
2
)
y
(pLSA and LDA)
Sivic et al. 2005, Sudderth et al. 2005
Fei-Fei et al. 2005
2. Discriminative method:
SVM
Grauman & Darrell, 2005, 2006
Csurka, Bray, Dance & Fan, 2004
Serre & Poggio, 2005
0 0.2 0.4 0.6 0.8 1
0
1
c
la
p(x|C
1
)
x
0.6
0.8
1
r
io
r
p
r
o
b
a
b
ilitie
s
p(C
1
|x) p(C
2
|x)
& ogg o, 005
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
p
o
s
te
r
io
r
x
For each category C
j
,
P(C
j
) = Number of images of category C
j
/ Total number of
images
Categorization by Nave Bayes: Learning
For every category C
j
and keypoint V
t

=
e
e
+
+
=
V
s
C I
is
C I
it
j t
j i
j i
n V
n
C V P
1
1
) | (
i t n
it
image in appears keypoint times of number :
Categorization by Nave Bayes: Recognition
it
n
j
V
t
t j j i j i j
C V P C P C I P C P I C P ) | ( ) ( ) | ( ) ( ) | (
| |
1
[
=
=
16
Of all the sensory impressions proceeding to
the brain, the visual experiences are the
dominant ones. Our perception of the world
around us is based essentially on the
messages that reach the brain from our eyes.
For a long time it was thought that the retinal
Discussion
g g
image was transmitted point by point to visual
centers in the brain; the cerebral cortex was a
movie screen, so to speak, upon which the
image in the eye was projected. Through the
discoveries of Hubel and Wiesel we now
know that behind the origin of the visual
perception in the brain there is a considerably
more complicated course of events. By
following the visual impulses along their path
to the various cell layers of the optical cortex,
sensory, brain,
visual, perception,
retinal, cerebral cortex,
eye, cell, optical
nerve, image
Hubel, Wiesel
Intuitive
Analogy to documents
demonstrate that the message about the
image falling on the retina undergoes a step-
wise analysis in a system of nerve cells
stored in columns. In this system each cell
has its specific function and is responsible for
a specific detail in the pattern of the retinal
image.
Discussion
Intuitive
Analogy to documents
Analogy to human
vision
17
Discussion
Intuitive
Flexibility comes with
ignoring geometry
Compact description, yet
rich
Local features -> vector
Usable representation
Relatively efficient
learning
Yields good results in
practice
No rigorous geometric information of the object
components
Its intuitive to most of us that objects are made
Weak ness of t he model Weak ness of t he model
Discussion
It s intuitive to most of us that objects are made
of parts no such information
Not extensively tested yet for
View point invariance
Scale invariance
Segmentation and localization unclear
Extensions to include spatial info and localization
Feature level: Savarese, Winn and Criminisi, CVPR 2006
Generative models: Sudderth, Torralba, Freeman & Willsky,
2005, 2006, Niebles & Fei-Fei, CVPR 2007
Discriminative methods: Lazebnik, Schmid & Ponce, 2006
Sliding Windows search: Lampert, Blaschko, Hofmann, CVPR
2008
18
Adding spatial information
Feat ur e l evel : cor r el ogr am f eat ur es (Savarese, Winn
and Criminisi 2006)
Based on spatial co-ocurrences of words along increasing ranges
of distances (kernels) of distances (kernels)
Every pixel is labeled with the closest visual word.
For every pixel, a set of kernels are defined at increasing distances.
For every pixel and every kernel a local histogram is computed
counted the number of occurrences of every word in the kernel.
f
r
e
q
u
e
n
c
y
T Visual Words
Feat ur e l evel : cor r el ogr am f eat ur es
Efficient computation with the use of an integral image
Every point of the integral image contains a vector, H(), with
the number of ocurrences of every visual word in the rectangle
defined from the origing of the image
f
r
e
q
u
e
n
c
y
T Visual Words
defined from the origing of the image
The value of a correlogram for a given pixel and kernel can be
computed:
19
For every visual word i and every kernel we can
aggregate local histograms of all pixels associated to this
d word.
Then, for every pair of visual words a cor r el ogr am is
defined as the histogram of co-ocurrences for all kernels
f
r
e
q
u
e
n
c
y
T Visual Words
Correlogram between visual words 1 and 2
r Kernels
Dictionary of correlograms
Computation of all correlograms between every pair of
visual words for all training images visual words for all training images
Clustering of all correlograms using k-means to obtain a
fixed set of representatives called cor r el at ons
For every image correlograms are computed and assigned to
the closest correlaton to obtain a histogram of correlatons
Final representation of an image: joint histogram of visual
words and correlatons
20
9 classes dataset
15 classes dataset
Gener at i ve model s (Niebles & Fei-Fei, CVPR 2007)
Definition of a hierarchical model
A part layer describing relative locations of parts in the p y g p
object
A feature level composed of visual words describing
the appearance of every part
Each feature is assigned to one of the parts
21
Gener at i ve model s
Probabilistic definition of the model
Y: Position of parts
w: Features: relative position and appearance (visual word)
h: set of all possible positions of parts
m: Assignment between features and parts
Part layer: multivariate Gaussian distribution
Feature layer: given an assignment between parts and
features, combination of gaussian distributions of position and
appearance for every part.
Gener at i ve model s
Assignment between features and parts:
Every word assigned independently to the part with highest Every word assigned independently to the part with highest
probability according to position and appearance
Training of the model: EM algorithm
22
Di scr i mi nat i v e model s: Spat i al Py r ami ds
(Lazebnik, Schmid, Ponce, 2006)
Hierarchical computation of histograms at different
spatial image decompositions
Regular grid at every decomposition level
Combination of histograms at all levels of
decomposition
Py r ami d Mat ch Ker nel
Histogram
intersection intersection
Slide credit: Kristen Grauman
23
Histogram
intersection
matches at this level matches at finer level
intersection
)) ( ), ( ( )) ( ), ( (
1 1
Y H X H I Y H X H I N
i i i i i + +
=
Difference in histogram intersections across
levels counts number of new pairs matched
histogram pyramids
measure of difficulty of
number of newly matched pairs at level i
( )
=
+ +

L
i
i i i i
i L
Y H X H I Y H X H I
0
1 1
)) ( ), ( ( )) ( ), ( (
2
1
Weights inversely proportional to bin size
Normalize kernel values to avoid favoring large sets
a match at level i
24
Level 2
1
2
2
2
=
=
e
N
) (
2
X H
) (
2
Y H
2
2
= I
Level 1
2 2 4
1
= = N
2
1
2 2 4
2
1
= e
N
) (
1
X H
) (
1
Y H
2
2
= I
25
Level 0
1 4 5 = = N
4
1
1 4 5
2
0
=
= =
e
N
) (
0
X H
) (
0
Y H
5
0
= I
pyramid match
25 . 3 ) 2 ( 1 ) 2 (
2
1
) 1 (
4
1
= + + =
Final kernel value: sum
over all words m
26
Spat i al Py r ami ds: Resul t s
Scene category database
Caltech-101 database
Graz database
Object localization
Sliding windows (Lampert, Blaschko, Hofmann, 2008)
Sliding windows through all the image at different scales and positions.
Definition of a quality bound for every subwindow based on SVM
classification (bound of the highest SVM score for all possible rectangles ( g p g
contained in the subwindow)
Branch and Bound search
Best-first search: selection of the candidate subwindow with
highest quality bound
The selected candidate subwindow is split in two halves
The search continues until all candidate subwindows have quality
bound lower than the best analyzed rectangle.
27
Object localization
Sl i di ng w i ndow s
Example of quality bound based on SVM using linear kernel
The SVM decision function is:
It can be rewritten as:
n: number of feature points, c
j
: word associated to feature j
Th lit b d i The quality bound is:
Some examples
J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object
Matching in Videos. ICCV 2003.
Shilpa G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with Shilpa G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with
bags of keypoints. In Workshop on Statistical Learning in Computer Vision,
ECCV, 2004.
F. Perronnin, C. Dance, G. Csurka, M. Bressan. Adapted Vocabularies for
Generic Visual Categorization, ECCV 2006.
Writer identification using Bag of Words
28
Feat ur e det ect i on
Two types of regions around interest points:
Shape adapted (SA): ellipses maximizing intensity Shape adapted (SA): ellipses maximizing intensity
gradient isotropy over the elliptical region
Centered on corner-like features
Maximally Stable (MS): selection of areas from an intensity
watershed image segmentation. Area approximately
stationary as the intensity threshold is varied.
Blobs of high contrast Blobs of high contrast
Only stable regions along more than three frames
Feature representation
SIFT features for each region
29
2. Dictionay formation
Clustering
K-means
Mahalanobis distance
Independent for SA and MS regions
Subset of 48 shots
Clusters for SA regions Clusters for MS regions
Weighting of words
Word frequency (n
id
/ n
d
) weights words occurring often in a
particular document
I d t f (N / ) d i ht d th t Inverse document frequency (N / n
i
) : downweight words that
appear often in the database
n
id
: number of occurrences of word i in document d
n
d
: total number of words in the document d
n
i
: number of occurrences of term i in the whole database
N : number of documents in the whole
database.
An image is represented by a vector:
) , , (
1 k
t t V =
30
Retrieval system:
Documents are ranked by their normalized scalar product between
the query vector V
q
and all document vectors V
d
in the database.
Retrieved documents are re-ranked according to a measure of g
spatial consistency between matching words in the query and the
retrieved document
Results
31
Some examples
ECCV, 2004.
Feature detection:
Extract interest points using Harris affine detector.
Attach SIFT descriptors to the interest points. A SIFT
description is 128 dimension vector.
32
Use a k-means clustering algorithm to form a set of
clusters of feature vectors.
The feature vectors associated with the cluster
centers(V
1
..V
m
) form a vocabulary.
Find multiple sets of clusters using different values of k.
Select K giving the lowest empirical risk in categorization.
Extract keypoint descriptors from the image.
Put the descriptor in the cluster or bag with minimum
distance from cluster center.
Count the number of keypoints in each bag.
Minimum distance
from V
2
F
Image I
n
ij
is the total number of times
a feature near V
j
occurs in
image i
(n
i1,
n
i2
n
im
)
V
1
V
2
V
m
33
For each category C
j
,
P(C
j
) = Number of images of category C
j
/ Total number of images
Categorization by Nave Bayes: Learning
For every category C
j
and keypoint V
t

=
e
e
+
+
=
V
s
C I
is
C I
it
j t
j i
j i
n V
n
C V P
1
1
) | (
i t n
it
image in appears keypoint times of number :
Categorization by Nave Bayes: Recognition
it
n
j
V
t
t j j i j i j
C V P C P C I P C P I C P ) | ( ) ( ) | ( ) ( ) | (
| |
1
[
=
=
Categorization by SVM: Training
For an image of category C
i
, x
i
is a vector formed
by the number of occurrences of keypoints V in the by the number of occurrences of keypoints V in the
image.
For the m class problem, m SVMs are trained, each to
distinguish some category C
i
from the other m-1.
Categorization by SVM: Testing
Given a query image, assign it to the category with the
highest SVM output.
34
Results
Results
Naves Bayes
SVM
35
Results
Some examples
ECCV, 2004.
36
Regular grid to extract patches at different scales
capturing characteristic & distinctive information about the
represented scene or objects.
Images resized and normalized before feature detection
Feature detection
Images resized and normalized before feature detection
Per image: ~ 500 patches
Better performance with respect to accuracy
Better performance with respect to speed
Feature description
Feature descriptors: SIFT
Computed at every patch (cell of the grid)
Per patch: one 128 dimensional descriptor, reduced to 50
using PCA
using PCA
37
Visual words represented by GMM
Class-specific vocabularies:
Learn a universal vocabularyy
Adapt it efficiently to obtain class specific vocabulary
(MAP)
For each class, form a new vocabulary by merging
the universal and adapted vocabularies
Assumption
If an image belongs to a class, it is best described by
the adapted vocabulary of this class
If not, it is best described by the universal vocabulary
38
First learn a universal vocabulary (MLE)
Then adapt it efficiently to obtain class specific
vocabulary (MAP)
MLE
MAP
EM algorithm to learn the GMM
For each image and for each class, compute bi -
par t i t e histograms:
Class-specific vocabulary
Universal vocabulary
One histogram per class
39
Classification: SVM (one vs. all)
One classifier per class
Each classifier is fed with different histograms Each classifier is fed with different histograms
Results
Patch relevance on Flowers category
Patch relevance on Sky category Patch relevance on Boat category
40
Results
Some examples
ECCV, 2004.
41
Writer identification in musical scores
Feat ur e det ect i on
Pre-processing
Each connected component is an interest point Each connected component is an interest point
Shape descriptor: Blurred shape model
42
2. Dictionay formation
Clustering
GMM
Supervised (128 words) vs. Unsupervised (16
d / l ) words/class)
Histogram of words obtained with soft assignment
Classification with SVM
Results
43
References
J. Sivic and A. Zisserman. Video Google: A Text Ret rieval Approach t o
Obj ect Mat ching in Videos. ICCV 2003.
Shilpa G. Csurka, C. Bray, C. Dance, and L. Fan. Visual cat egorizat ion
wit h bags of keypoint s. In Workshop on Statistical Learning in
Computer Vision, ECCV, 2004.
F. Perronnin, C. Dance, G. Csurka, M. Bressan. Adapt ed Vocabularies
for Generic Visual Cat egorizat ion, ECCV 2006.
J. Vogel, B.Schiele. Semant ic Modeling of Nat ural Scenes for Cont ent -
Based I mage Ret rieval. International Journal of Computer Vision, 2006.
L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning
nat ural scene cat egories. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2005. p g ,
J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman. Discovering
obj ect cat egories in image collect ions. Technical Report A. I. Memo
2005-005, Massachusetts Institute of Technology, 2005.
S. Lazebnik, C. Schmid, J. Ponce. Beyond Bags of Feat ures: Spat ial
Pyramid Mat ching for Recognizing Nat ural Scene Cat egories. CVPR 2006.
References
C. H. Lampert, M. B. Blaschko, T. Hofmann. Beyond Sliding Windows:
Obj ect Localizat ion by Efficient Subwindow Search. CVPR 2008.
E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. Learning
hierarchical models of scenes, obj ect s, and part s. In Proceedings of hierarchical models of scenes, obj ect s, and part s. In Proceedings of
the IEEE International Conference on Computer Vision, 2005.
K. Grauman, T. Darrell. The Pyramid Mat ch Kernel: Efficient Learning
wit h Set s of Feat ures. Journal of Machine Learning Research, 2008.
T. Serre, L. Wolf, T. Poggio. Obj ect Recognit ion wit h Feat ures I nspired
by Visual Cort ex. CVPR 2005.
J. Savarese, A. Winn and T. Criminisi, Obj ect Cat egorizat ion by
Learned Universal Visual Dict ionary. CVPR 2006
J C Ni bl L F i F i A Hi hi l M d l f Sh d A J.C. Niebles, L. Fei-Fei. A Hierarchical Model of Shape and Appearance
for Human Act ion Classificat ion. CVPR 2007
G. Csurka, C. R. Dance, L. Fan, J. Willamowski, C. Bray. Visual
Cat egorizat ion wit h Bags of Keypoint s. ECCV 2004
D. Lowe. Obj ect Recognit ion from Local Scale- I nvariant Feat ures. ICCV
1999
44
References
E. Nowak, F. Jurie, B. Triggs. Sampling Strategies for Bag-of-Features
Image Classification. ECCV06
S. Savarese, J. Winn, A. Criminisi. Discriminat ive Obj ect Class Models
of Appearance and Shape by Correlat ons. CVPR 2006 of Appearance and Shape by Correlat ons. CVPR 2006
H. Bay, A. Ess, T. Tuytelaars, Luc Van Gool, SURF: Speeded Up Robust
Feat ures, Computer Vision and Image Understanding (CVIU), 2008

Visual Perception and Object Recognition

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Visual Perception and Object Recognition

Uploaded by

Copyright:

Available Formats

1

Slide credit: Li Fei-Fei

You might also like