You are on page 1of 30

Spatio-temporal features

Actions, Activities, Events

Actions are short task oriented body movements such as waving a hand, or drinking
from a bottle. Some actions are atomic but often actions of interest have a cyclic nature
such as walking or running.

Activities involve multiple people or happen in longer timeframes. Activities are often the
result of a combination of actions like taking money out from ATM or waiting for a bus

We often refer to an Event as a combination of activities, usually involving more people


and happening in a given context such as a soccer match, a car accident or a fire in a
wood..

All of these are not rigorous definitions

Computer vision grand challenge: video understanding


indoors

outdoors

exit
through
a door

person
building

drinking
glas
s

outdoors

street

car

countryside

Objects:
cars, glasses,
people, etc

person

car

people
candle

car
enter

Actions:
drinking, running,
door exit, car enter,
etc

person

car

car
crash

road
field

constraints

outdoors
house

Scene categories:
indoors, outdoors,
street scene, etc

car

Geometry:
Street, wall, field,
stair, etc

kidnapping

person
car
street

car

Requirements for action recognition

A generic action recognition framework needs a robust enough representation in order to


have classifiers concentrate on the real discriminant spatio-temporal features and not to get
distracted by clutter or other irrelevant intra-class variations. Intra-class variation is due to
many factors:

Person appearance variation because of gender, clothing and body posture and size.

Camera parameters, scene clutter and illumination.

Camera motion need to be either removed via motion compensation or with robust
representations.

A robust representation is able to remove all the noisy features (clothing, gender,
illumination, scale etc.) and preserve variability with respect to the body motion involved in
different actions.

Action representation

Actions can be described following different approaches:

Holistic representations: each action is represented by a vector of features.


Local representations: each action is represented with a set of feature vectors.
Feature fusion/context modelling: each action is represented with a fusion of multiple
diverse features also representing the context of the action.

Holistic representation: Motion History Images

Perform image differencing to detect motion, eventually with background subtraction :

Motion Energy image is a binary image defined as follows:

it describes WHERE the motion happens.

Motion History Image is a real valued image defined as follows:

it describes HOW the motion happens.

Motion History Image descriptors

MHI

Motion History Image can be described synthetically though image moments of different
order:

A.F. Bobick and J.W. Davis, IEEE TPAMI

Hu moments: recall

Given a distribution (image intensity) moments of order p,q are defined as:

Central moments are translation invariant and defined in term of moments:

In order to obtain rotational invariance we define:

The first four Hu moments are defined as:

Example
Aerobic dataset: 18 moves
NN classifier using Mahalanobis distance achieves 66% accuracy.

Holistic approaches summary

Simple and fast solution: works very well in controlled settings.


Prone to errors of background subtraction.

Variations in light, shadows, clothing

What is the background here?

Does not capture interior motion and shape. Silhouette tells little about actions.

Space-time local features

A more useful and effective approach is to extract local features at space-time interest points and
encode the temporal information directly into the local feature. This results into the definition of
spatio-temporal local features that embed space and time jointly. In this case:
Videos are considered as volumes of pixels.
Spatio-temporal features are located at spatio-temporal salient points that are extracted
with interest point operators.
Similarly as for the 2D case, interest point structures are searched for that are stable under
rotation, viewpoint, scale and illumination changes.

Space time interest point detectors are extensions of 2D interest point detectors that incorporate
temporal information.

Most popular solutions


Detectors:

STIP Spatio Temporal Interest Points (Harris3D) [I. Laptev, IJCV 2005]
Dollars detector [P. Dollar et al., VS-PETS 2005]
Hessian3D [G. Willems et al., ECCV 2008]
Regular sampling [H. Wang et al. BMVC 2009]

Descriptors:
HOG/HOF [I. Laptev, et al. CVPR 2008]
Dollar [P. Dollar et al., VS-PETS 2005]
HoG3D [A. Klaeser et al., BMVC 2008]
Extended SURF [G. Willems et al., ECCV 2008]

STIP: Spatio Temporal Interest Points


Spatio-temporal Interest points (STIP) were proposed by I. Laptev in 2005. They are based on the
detection of spatio-temporal corners.

Spatio-temporal corners are located in region that exhibits a high variation of image intensity in all
three directions (x, y , t). This requires that spatio-temporal corners are located at spatial corners
such that they invert motion in two consecutive frames (high temporal gradient variation)
They are identified from local maxima of a cornerness function computed for all pixels across
spatial and temporal scales.
.

STIP Detector
The Harris-corner operator is extended to time:
Represent video as a function f (x,y,t)
Compute Gaussian derivatives L with kernel g using covariance . For each single
scale pair (,) Gaussian derivatives L are computed for each pixel p.

Covariance

Spatial scale , temporal scale


The space-time gradient is obtained as:

Extract interest point by evaluating the distribution of L within a local neighborhood.


The matrix mof second moments measures the variation of gradients:

Second-moment matrix

High variation of L implies large eigenvalues of m

Spatio-temporal
corners are obtained from the local maxima of H over (x,y,t)

Similar to Harris operator where lare the eigenvalues of H and k a constant with value close to 0.15

Scale selection in space and time


Scale invariance is obtained by selecting space-time locations at their characteristic scale.
The normalized Laplacian is able to select this scale for Harris corners.

Scale selection algorithm :


Detect space-time corners for a sparse combination of spatial and temporal scales (i,j)
For each point detected at location (x,y,t, i,j ) compute normalized Laplacian for given
location and at neighboring scales: location (x,y,t, 2 i, 2 j ) and = -0.25, 0, + 0.25.
Select location (x,y,t, i,j ) that maximizes the normalized Laplacian.
Examples
hand waving
boxing

STIP summary

Derived from 2D Harris corner detector


Maxima of H correspond to:

spatial corner inverting motion

joining/splitting structures
It is very robust but sparse
Scale selection is computationally expensive

Dollars periodic motion detector


The spatio-temporal detector proposed by Dollar treats differently time and space. Attempts to
solve Laptev detectors excessive sparseness of the interest points due to the rarity of true
space-time corners and to the scale-selection process.
The Dollars detector obtains a denser sampling by avoiding scale selection and uses a Gaussian
filter in space and a Gabor filter in time.
The Gaussian filter performs spatial scale () selection by smoothing each frame
The Gabor bandpass filter gives high responses to periodic variation of the signal
The interest point detector R is computed as follows:

R ( I g ( ) hev ) 2 ( I g ( ) hod ) 2
hev cos(2 t ) e t

/ 2

hod sin( 2 t ) e t

/ 2

g ( )

e ( x

, 4 /

y 2 ) / 2 2

2 2
2,4,8... , 2,4,8...

Multiple scales in space and time can be used in order


to increase the amount of interest point selected and
to represent space-time structures at different scales

Importance of multiple scales


The spatial scale refers to the size of the moving object:
we can detect the same event observing it at different distances
we are able to select events of different spatial sizes (e.g. head, legs)
The temporal scale refers to the speed at which the object moves:
we are able to detect the same event performed at a different speed.
we are able to detect the proper scale for different events.
Detector response (large scale)

...consider a walking person

Detector response (small scale)

...at a certain scale only


the torso motion is detected

...although legs and arms


movement are undoubtfully
more informative.

Red denotes a high detector response at a given space and time.

Dollars detector summary

Separates time and space filters


It is a band-pass filter in time
It is denser then Harris 3D
There is no scale selection: dense scale sampling can be used instead

Hessian 3D detector
It is conceptually derived from SURF extended to time: uses box filters and integral videos to
speed up.
It is faster and denser than Harris 3D but less dense than Dollars detector
Performs scale selection but it is performed by scaling the filter not the image.

Space-time feature detectors


Harris3D

Dollars

Hessian3D

Dense

Descriptors for spatio-temporal patches


At each spatio-temporal interest point, descriptors are defined taking into account the volume of
the
cuboid neighbourhood. The size of the cuboid is obtained from the scale as (k ) (k ) (k )with k
a suitable constant typically equal to 6.
Descriptors of the volume are computed with a common framework:
Preprocessing: volumes are smoothed with a gaussian 3D kernel
Spatio-temporal pooling: the volume is sub-divided into a number of smaller cuboid volumes
(e.g. 3x3x2 cuboids)
Feature computation (for each pixel a function or a transformation is computed in order to
obtain invariance to illumination and rotation) followed by feature quantization (histograms
of the computed features are accumulated):

HOG, HOF descriptors


Typical representations widely used are:
Histogram of 3D gradient orientations (HOG) based on space-time pixel values derivatives.
Models the apperance
Histogram of Optical Flow magnitude and orientation (HOF). Models the motion.
They obtain the better performance since they represent the dynamic content of the cuboid
volume.

3D Gradient (HoG)

3D gradient is computed at each pixel by differentiating the image function I(x,y,t) R (three
channels are obtained):
Gx (x,y,t) = I(x+1, y, t) - I(x-1, y, t)
Gy (x,y,t) = I(x, y+1, t) - I(x, y+1, t)
Gt (x,y,t) = I(x, y, t+1) - I(x, y, t-1)

Gradient is represented using Magnitude M and Orientation of and .

3D Gradient

M Gx2 G y2 Gt2

tan 1 (Gt / Gx2 G y2 )


tan 1 (G y / Gx )
Orientations are quantized similarly to SIFT but in 3D there is a normalization issue: solid angles near
the equator weight more with respect to solid angles near the poles

Solution 1): Weight the orientation bins with the inverse of the solid angle

Solution 2): Use platonic solids located at the centers of each cuboid subvolume to quantize
gradient orientation (platonic solids have congruent faces i.e. angles corresponding to faces are all
equal) and perform quantization by projecting gradient vectors on normals to solid faces

Projects gradient vector


jointly characterized by and

Solution 3) Quantize orientation separately: avoids rescaling of bins and keep histograms dense (the
simplest)

Computes hisograms of
and separately

Optical flow (HoF)


Optical flow measures the apparent motion of a pixel between two frames. If the camera is still it
corresponds to movement of objects in the world projected onto the image plane. In case of egomotion the information carried by the optic flow may be misleading.
Several methods have been proposed (Horn and Schunck 81, Lucas and Kanade 81).
They assume that image intensity does not change significantly from one frame to another due to
illumination. Variations of intensity are therefore exploited to compute pixel velocities.
Aperture problem: a vertical edge moving vertically produces null optical flow.

Optic flow is represented by quantizing the orientation of velocity vector with components Vx ,Vy .
A bin of no-motion is usually computed.

Optical flow

M Vx2 V y2

tan 1 (V y /Vx )

E-SURF descriptor
3D cuboid is divided into cells.
Bins are filled with weighted sums of responses of the axis-aligned Haar-wavelets dx, dy, dt.
Sums of absolute values are not included (as in SURF 2D) since they dont improve performance.

Learn a PCA basis from gradient of cuboids


Project gradient of pixels onto first 100 principal components to get descriptor

PCAsd

descriptor

basis: first 100 eigen


vectors

You might also like