A51 Spatio Temporal Features1

Spatio-temporal features
Actions, Activities, Events
Actions are short task oriented body movements such as waving a hand, or drinking
from a bottle. Some actions are atomic but often actions of interest have a cyclic nature
such as walking or running.
Activities involve multiple people or happen in longer timeframes. Activities are often the
result of a combination of actions like taking money out from ATM or waiting for a bus
We often refer to an Event as a combination of activities, usually involving more people

and happening in a given context such as a soccer match, a car accident or a fire in a
wood..
All of these are not rigorous definitions
Computer vision grand challenge: video understanding

indoors
outdoors
exit
through
a door
person
building
drinking
glas
s
outdoors
street
car
countryside
Objects:
cars, glasses,
people, etc
person
car
people
candle
car
enter
Actions:
drinking, running,
door exit, car enter,
etc
person
car
car
crash
road
field
constraints
outdoors
house
Scene categories:
indoors, outdoors,
street scene, etc
car
Geometry:
Street, wall, field,
stair, etc
kidnapping
person
car
street
car
Requirements for action recognition
A generic action recognition framework needs a robust enough representation in order to

have classifiers concentrate on the real discriminant spatio-temporal features and not to get
distracted by clutter or other irrelevant intra-class variations. Intra-class variation is due to
many factors:
Person appearance variation because of gender, clothing and body posture and size.
Camera parameters, scene clutter and illumination.
Camera motion need to be either removed via motion compensation or with robust
representations.
A robust representation is able to remove all the noisy features (clothing, gender,
illumination, scale etc.) and preserve variability with respect to the body motion involved in
different actions.
Action representation
Actions can be described following different approaches:
Holistic representations: each action is represented by a vector of features.

Local representations: each action is represented with a set of feature vectors.
Feature fusion/context modelling: each action is represented with a fusion of multiple
diverse features also representing the context of the action.
Holistic representation: Motion History Images
Perform image differencing to detect motion, eventually with background subtraction :
Motion Energy image is a binary image defined as follows:
it describes WHERE the motion happens.
Motion History Image is a real valued image defined as follows:
it describes HOW the motion happens.
Motion History Image descriptors
MHI
Motion History Image can be described synthetically though image moments of different
order:
A.F. Bobick and J.W. Davis, IEEE TPAMI
Hu moments: recall
Given a distribution (image intensity) moments of order p,q are defined as:
Central moments are translation invariant and defined in term of moments:
In order to obtain rotational invariance we define:
The first four Hu moments are defined as:
Example
Aerobic dataset: 18 moves
NN classifier using Mahalanobis distance achieves 66% accuracy.
Holistic approaches summary
Simple and fast solution: works very well in controlled settings.

Prone to errors of background subtraction.
Variations in light, shadows, clothing
What is the background here?
Does not capture interior motion and shape. Silhouette tells little about actions.
Space-time local features
A more useful and effective approach is to extract local features at space-time interest points and
encode the temporal information directly into the local feature. This results into the definition of
spatio-temporal local features that embed space and time jointly. In this case:
Videos are considered as volumes of pixels.
Spatio-temporal features are located at spatio-temporal salient points that are extracted
with interest point operators.
Similarly as for the 2D case, interest point structures are searched for that are stable under
rotation, viewpoint, scale and illumination changes.
Space time interest point detectors are extensions of 2D interest point detectors that incorporate
temporal information.
Most popular solutions

Detectors:
STIP Spatio Temporal Interest Points (Harris3D) [I. Laptev, IJCV 2005]
Dollars detector [P. Dollar et al., VS-PETS 2005]
Hessian3D [G. Willems et al., ECCV 2008]
Regular sampling [H. Wang et al. BMVC 2009]
Descriptors:
HOG/HOF [I. Laptev, et al. CVPR 2008]
Dollar [P. Dollar et al., VS-PETS 2005]
HoG3D [A. Klaeser et al., BMVC 2008]
Extended SURF [G. Willems et al., ECCV 2008]
STIP: Spatio Temporal Interest Points

Spatio-temporal Interest points (STIP) were proposed by I. Laptev in 2005. They are based on the
detection of spatio-temporal corners.
Spatio-temporal corners are located in region that exhibits a high variation of image intensity in all
three directions (x, y , t). This requires that spatio-temporal corners are located at spatial corners
such that they invert motion in two consecutive frames (high temporal gradient variation)
They are identified from local maxima of a cornerness function computed for all pixels across
spatial and temporal scales.
.
STIP Detector
The Harris-corner operator is extended to time:
Represent video as a function f (x,y,t)
Compute Gaussian derivatives L with kernel g using covariance . For each single
scale pair (,) Gaussian derivatives L are computed for each pixel p.
Covariance
Spatial scale , temporal scale

The space-time gradient is obtained as:
Extract interest point by evaluating the distribution of L within a local neighborhood.

The matrix mof second moments measures the variation of gradients:
Second-moment matrix
High variation of L implies large eigenvalues of m
Spatio-temporal
corners are obtained from the local maxima of H over (x,y,t)
Similar to Harris operator where lare the eigenvalues of H and k a constant with value close to 0.15
Scale selection in space and time

Scale invariance is obtained by selecting space-time locations at their characteristic scale.
The normalized Laplacian is able to select this scale for Harris corners.
Scale selection algorithm :

Detect space-time corners for a sparse combination of spatial and temporal scales (i,j)
For each point detected at location (x,y,t, i,j ) compute normalized Laplacian for given
location and at neighboring scales: location (x,y,t, 2 i, 2 j ) and = -0.25, 0, + 0.25.
Select location (x,y,t, i,j ) that maximizes the normalized Laplacian.
Examples
hand waving
boxing
STIP summary
Derived from 2D Harris corner detector

Maxima of H correspond to:
spatial corner inverting motion
joining/splitting structures
It is very robust but sparse
Scale selection is computationally expensive
Dollars periodic motion detector

The spatio-temporal detector proposed by Dollar treats differently time and space. Attempts to
solve Laptev detectors excessive sparseness of the interest points due to the rarity of true
space-time corners and to the scale-selection process.
The Dollars detector obtains a denser sampling by avoiding scale selection and uses a Gaussian
filter in space and a Gabor filter in time.
The Gaussian filter performs spatial scale () selection by smoothing each frame
The Gabor bandpass filter gives high responses to periodic variation of the signal
The interest point detector R is computed as follows:
R ( I g ( ) hev ) 2 ( I g ( ) hod ) 2
hev cos(2 t ) e t
/ 2
hod sin( 2 t ) e t
/ 2
g ( )
e ( x
, 4 /
y 2 ) / 2 2
2 2
2,4,8... , 2,4,8...
Multiple scales in space and time can be used in order

to increase the amount of interest point selected and
to represent space-time structures at different scales
Importance of multiple scales

The spatial scale refers to the size of the moving object:
we can detect the same event observing it at different distances
we are able to select events of different spatial sizes (e.g. head, legs)
The temporal scale refers to the speed at which the object moves:
we are able to detect the same event performed at a different speed.
we are able to detect the proper scale for different events.
Detector response (large scale)
...consider a walking person
Detector response (small scale)
...at a certain scale only

the torso motion is detected
...although legs and arms

movement are undoubtfully
more informative.
Red denotes a high detector response at a given space and time.
Dollars detector summary
Separates time and space filters

It is a band-pass filter in time
It is denser then Harris 3D
There is no scale selection: dense scale sampling can be used instead
Hessian 3D detector
It is conceptually derived from SURF extended to time: uses box filters and integral videos to
speed up.
It is faster and denser than Harris 3D but less dense than Dollars detector
Performs scale selection but it is performed by scaling the filter not the image.
Space-time feature detectors

Harris3D
Dollars
Hessian3D
Dense
Descriptors for spatio-temporal patches

At each spatio-temporal interest point, descriptors are defined taking into account the volume of
the
cuboid neighbourhood. The size of the cuboid is obtained from the scale as (k ) (k ) (k )with k
a suitable constant typically equal to 6.
Descriptors of the volume are computed with a common framework:
Preprocessing: volumes are smoothed with a gaussian 3D kernel
Spatio-temporal pooling: the volume is sub-divided into a number of smaller cuboid volumes
(e.g. 3x3x2 cuboids)
Feature computation (for each pixel a function or a transformation is computed in order to
obtain invariance to illumination and rotation) followed by feature quantization (histograms
of the computed features are accumulated):
HOG, HOF descriptors

Typical representations widely used are:
Histogram of 3D gradient orientations (HOG) based on space-time pixel values derivatives.
Models the apperance
Histogram of Optical Flow magnitude and orientation (HOF). Models the motion.
They obtain the better performance since they represent the dynamic content of the cuboid
volume.
3D Gradient (HoG)
3D gradient is computed at each pixel by differentiating the image function I(x,y,t) R (three
channels are obtained):
Gx (x,y,t) = I(x+1, y, t) - I(x-1, y, t)
Gy (x,y,t) = I(x, y+1, t) - I(x, y+1, t)
Gt (x,y,t) = I(x, y, t+1) - I(x, y, t-1)
Gradient is represented using Magnitude M and Orientation of and .
3D Gradient
M Gx2 G y2 Gt2
tan 1 (Gt / Gx2 G y2 )

tan 1 (G y / Gx )
Orientations are quantized similarly to SIFT but in 3D there is a normalization issue: solid angles near
the equator weight more with respect to solid angles near the poles
Solution 1): Weight the orientation bins with the inverse of the solid angle
Solution 2): Use platonic solids located at the centers of each cuboid subvolume to quantize
gradient orientation (platonic solids have congruent faces i.e. angles corresponding to faces are all
equal) and perform quantization by projecting gradient vectors on normals to solid faces
Projects gradient vector

jointly characterized by and
Solution 3) Quantize orientation separately: avoids rescaling of bins and keep histograms dense (the
simplest)
Computes hisograms of
and separately
Optical flow (HoF)

Optical flow measures the apparent motion of a pixel between two frames. If the camera is still it
corresponds to movement of objects in the world projected onto the image plane. In case of egomotion the information carried by the optic flow may be misleading.
Several methods have been proposed (Horn and Schunck 81, Lucas and Kanade 81).
They assume that image intensity does not change significantly from one frame to another due to
illumination. Variations of intensity are therefore exploited to compute pixel velocities.
Aperture problem: a vertical edge moving vertically produces null optical flow.
Optic flow is represented by quantizing the orientation of velocity vector with components Vx ,Vy .
A bin of no-motion is usually computed.
Optical flow
M Vx2 V y2
tan 1 (V y /Vx )
E-SURF descriptor
3D cuboid is divided into cells.
Bins are filled with weighted sums of responses of the axis-aligned Haar-wavelets dx, dy, dt.
Sums of absolute values are not included (as in SURF 2D) since they dont improve performance.
Learn a PCA basis from gradient of cuboids

Project gradient of pixels onto first 100 principal components to get descriptor
PCAsd
descriptor
basis: first 100 eigen

vectors

A51 Spatio Temporal Features1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A51 Spatio Temporal Features1

Uploaded by

Copyright:

Available Formats

Spatio-temporal features

Actions, Activities, Events

We often refer to an Event as a combination of activities, usually involving more people

All of these are not rigorous definitions

Computer vision grand challenge: video understanding

Requirements for action recognition

A generic action recognition framework needs a robust enough representation in order to

Camera parameters, scene clutter and illumination.

Actions can be described following different approaches:

Holistic representations: each action is represented by a vector of features.

Holistic representation: Motion History Images

Perform image differencing to detect motion, eventually with background subtraction :

Motion Energy image is a binary image defined as follows:

it describes WHERE the motion happens.

Motion History Image is a real valued image defined as follows:

it describes HOW the motion happens.

Motion History Image descriptors

A.F. Bobick and J.W. Davis, IEEE TPAMI

Central moments are translation invariant and defined in term of moments:

In order to obtain rotational invariance we define:

The first four Hu moments are defined as:

Holistic approaches summary

Simple and fast solution: works very well in controlled settings.

Variations in light, shadows, clothing

What is the background here?

Space-time local features

Most popular solutions

STIP: Spatio Temporal Interest Points

Spatial scale , temporal scale

Extract interest point by evaluating the distribution of L within a local neighborhood.

High variation of L implies large eigenvalues of m

Scale selection in space and time

Scale selection algorithm :

Derived from 2D Harris corner detector

spatial corner inverting motion

Dollars periodic motion detector

Multiple scales in space and time can be used in order

Importance of multiple scales

...consider a walking person

Detector response (small scale)

...at a certain scale only

...although legs and arms

Red denotes a high detector response at a given space and time.

Dollars detector summary

Separates time and space filters

Space-time feature detectors

Descriptors for spatio-temporal patches

HOG, HOF descriptors

Gradient is represented using Magnitude M and Orientation of and .

tan 1 (Gt / Gx2 G y2 )

Projects gradient vector

Optical flow (HoF)

Learn a PCA basis from gradient of cuboids

basis: first 100 eigen

You might also like