Professional Documents
Culture Documents
Actions are short task oriented body movements such as waving a hand, or drinking
from a bottle. Some actions are atomic but often actions of interest have a cyclic nature
such as walking or running.
Activities involve multiple people or happen in longer timeframes. Activities are often the
result of a combination of actions like taking money out from ATM or waiting for a bus
outdoors
exit
through
a door
person
building
drinking
glas
s
outdoors
street
car
countryside
Objects:
cars, glasses,
people, etc
person
car
people
candle
car
enter
Actions:
drinking, running,
door exit, car enter,
etc
person
car
car
crash
road
field
constraints
outdoors
house
Scene categories:
indoors, outdoors,
street scene, etc
car
Geometry:
Street, wall, field,
stair, etc
kidnapping
person
car
street
car
Person appearance variation because of gender, clothing and body posture and size.
Camera motion need to be either removed via motion compensation or with robust
representations.
A robust representation is able to remove all the noisy features (clothing, gender,
illumination, scale etc.) and preserve variability with respect to the body motion involved in
different actions.
Action representation
MHI
Motion History Image can be described synthetically though image moments of different
order:
Hu moments: recall
Given a distribution (image intensity) moments of order p,q are defined as:
Example
Aerobic dataset: 18 moves
NN classifier using Mahalanobis distance achieves 66% accuracy.
Does not capture interior motion and shape. Silhouette tells little about actions.
A more useful and effective approach is to extract local features at space-time interest points and
encode the temporal information directly into the local feature. This results into the definition of
spatio-temporal local features that embed space and time jointly. In this case:
Videos are considered as volumes of pixels.
Spatio-temporal features are located at spatio-temporal salient points that are extracted
with interest point operators.
Similarly as for the 2D case, interest point structures are searched for that are stable under
rotation, viewpoint, scale and illumination changes.
Space time interest point detectors are extensions of 2D interest point detectors that incorporate
temporal information.
STIP Spatio Temporal Interest Points (Harris3D) [I. Laptev, IJCV 2005]
Dollars detector [P. Dollar et al., VS-PETS 2005]
Hessian3D [G. Willems et al., ECCV 2008]
Regular sampling [H. Wang et al. BMVC 2009]
Descriptors:
HOG/HOF [I. Laptev, et al. CVPR 2008]
Dollar [P. Dollar et al., VS-PETS 2005]
HoG3D [A. Klaeser et al., BMVC 2008]
Extended SURF [G. Willems et al., ECCV 2008]
Spatio-temporal corners are located in region that exhibits a high variation of image intensity in all
three directions (x, y , t). This requires that spatio-temporal corners are located at spatial corners
such that they invert motion in two consecutive frames (high temporal gradient variation)
They are identified from local maxima of a cornerness function computed for all pixels across
spatial and temporal scales.
.
STIP Detector
The Harris-corner operator is extended to time:
Represent video as a function f (x,y,t)
Compute Gaussian derivatives L with kernel g using covariance . For each single
scale pair (,) Gaussian derivatives L are computed for each pixel p.
Covariance
Second-moment matrix
Spatio-temporal
corners are obtained from the local maxima of H over (x,y,t)
Similar to Harris operator where lare the eigenvalues of H and k a constant with value close to 0.15
STIP summary
joining/splitting structures
It is very robust but sparse
Scale selection is computationally expensive
R ( I g ( ) hev ) 2 ( I g ( ) hod ) 2
hev cos(2 t ) e t
/ 2
hod sin( 2 t ) e t
/ 2
g ( )
e ( x
, 4 /
y 2 ) / 2 2
2 2
2,4,8... , 2,4,8...
Hessian 3D detector
It is conceptually derived from SURF extended to time: uses box filters and integral videos to
speed up.
It is faster and denser than Harris 3D but less dense than Dollars detector
Performs scale selection but it is performed by scaling the filter not the image.
Dollars
Hessian3D
Dense
3D Gradient (HoG)
3D gradient is computed at each pixel by differentiating the image function I(x,y,t) R (three
channels are obtained):
Gx (x,y,t) = I(x+1, y, t) - I(x-1, y, t)
Gy (x,y,t) = I(x, y+1, t) - I(x, y+1, t)
Gt (x,y,t) = I(x, y, t+1) - I(x, y, t-1)
3D Gradient
M Gx2 G y2 Gt2
Solution 1): Weight the orientation bins with the inverse of the solid angle
Solution 2): Use platonic solids located at the centers of each cuboid subvolume to quantize
gradient orientation (platonic solids have congruent faces i.e. angles corresponding to faces are all
equal) and perform quantization by projecting gradient vectors on normals to solid faces
Solution 3) Quantize orientation separately: avoids rescaling of bins and keep histograms dense (the
simplest)
Computes hisograms of
and separately
Optic flow is represented by quantizing the orientation of velocity vector with components Vx ,Vy .
A bin of no-motion is usually computed.
Optical flow
M Vx2 V y2
tan 1 (V y /Vx )
E-SURF descriptor
3D cuboid is divided into cells.
Bins are filled with weighted sums of responses of the axis-aligned Haar-wavelets dx, dy, dt.
Sums of absolute values are not included (as in SURF 2D) since they dont improve performance.
PCAsd
descriptor