Professional Documents
Culture Documents
Spiros Ioannou
1
, Manolis Wallace
2
, Kostas Karpouzis
1
, Amaryllis Raouzaiou
1
and Stefanos Kollias
1
1
National Technical ni!ersity of Athens,
", Iroon #olytechniou Str$, 1%& '( )o*raphou, Athens, +reece
2
ni!ersity of Indianapolis, Athens ,ampus,
", Ipitou Str$, 1(% %& Synta*ma, Athens, +reece
Abstract Since facial expressions are a key modality in
human communication, the automated analysis of facial images
for the estimation of the displayed expression is essential in the
design of intuitie and accessi!le human computer interaction
systems" #n most existing rule-!ased expression recognition
approaches, analysis is semi-automatic or re$uires high $uality
ideo" #n this paper %e propose a feature extraction system
%hich com!ines analysis from multiple channels !ased on their
confidence, to result in !etter facial feature !oundary
detection" &he facial features are then used for expression
estimation" &he proposed approach has !een implemented as
an extension to an existing expression analysis system in the
frame%ork of the #S& ERM#S pro'ect"
Index Terms Facial feature extraction, confidence,
multiple cue fusion, human computer interaction
I$ INTR-.,TI-N
In recent years there has /een a *ro0in* interest in
impro!in* all aspects of the interaction /et0een humans and
computers, pro!idin* a realization of the term 1affecti!e
computin*2 31%4$ 5umans interact 0ith each other in a
multimodal manner to con!ey *eneral messa*es6 emphasis
on certain parts of a messa*e is *i!en !ia speech and display
of emotions /y !isual, !ocal, and other physiolo*ical means,
e!en instincti!ely 7e$*$ s0eatin*8 3194$
Interpersonal communication is for the most part
completed !ia the face$ .espite common /elief, social
psycholo*y research has sho0n that con!ersations are
usually dominated /y facial e:pressions, and not spo;en
0ords, indicatin* the spea;er<s predisposition to0ards the
listener$ Mehra/ian indicated that the lin*uistic part of a
messa*e, that is the actual 0ordin*, contri/utes only for
se!en percent to the effect of the messa*e as a 0hole6 the
paralin*uistic part, that is ho0 the specific passa*e is
!ocalized, contri/utes for thirty ei*ht percent, 0hile facial
e:pression of the spea;er contri/utes for fifty fi!e percent to
the effect of the spo;en messa*e 324$ This implies that the
facial e:pressions form the ma=or modality in human
communication, and need to /e considered /y 5,I>MMI
systems$
In most real?life applications nearly all !ideo media ha!e
reduced !ertical and horizontal color resolutions6 moreo!er,
the face occupies only a small percenta*e of the 0hole frame
and illumination is far from perfect$ When dealin* 0ith such
input 0e ha!e to accept that color @uality and !ideo
resolution 0ill /e !ery poor$ While it is feasi/le to detect the
face and all facial features, it is !ery difficult to find the
e:act /oundary of each one 7eye, eye/ro0, mouth8 in order to
estimate its deformation from the neutral?e:pression frame$
Moreo!er it is !ery difficult to fit a precise model to each
feature or to employ trac;in* since hi*h?order fre@uency
information is missin* in such situations$ A 0ay to
o!ercome this limitation is to com/ine the result of multiple
feature e:tractors into a final result /ased on the e!aluation
of their performance on each frame6 the fusion method is
/ased on the o/ser!ation that ha!in* multiple mas;s for
each feature lo0ers the pro/a/ility that all of them are
in!alid since each of them produces different error patterns$
II$ AB#RASSI-N RA#RASANTATI-N
An automated emotion reco*nition throu*h facial
e:pression analysis system, must deal mainly 0ith t0o ma=or
research areasC automatic facial feature e:traction and facial
e:pression reco*nition$ Thus, it needs to com/ine lo0?le!el
ima*e processin* 0ith the results of psycholo*ical studies
a/out facial e:pression and emotion perception$
Most of the e:istin* e:pression reco*nition systems can /e
classified in t0o ma=or cate*oriesC the former includes
techni@ues 0hich e:amine the face in its entirety 7holistic
approaches8 and ta;e into account properties such as
intensity 3"4 or optical flo0 distri/utions and the latter
includes methods 0hich operate locally, either /y analyzin*
the motion of local features, or /y separately reco*nizin*,
measurin*, and com/inin* the !arious facial element
properties 7analytic approaches8$ A *ood o!er!ie0 of the
current state of the art is presented in 3D431(4$
In this 0or; 0e estimate facial e:pression throu*h the
estimation of the M#A+ EA#s$ EA#s are measured throu*h
detection of mo!ement and deformation of local intransient
facial features such as mouth, eyes and eye/ro0s in sin*le
frames$ Eeature deformations are estimated /y comparin*
their states to some frame, in 0hich the person<s e:pression
is ;no0n to /e neutral$ Althou*h EA#s 314 pro!ide all the
necessary elements for M#A+?D compati/le animation, 0e
cannot use them directly for the analysis of e:pressions from
!ideo scenes, due to the a/sence of a clear @uantitati!e
definition frame0or;$ In order to measure EA#s in real
ima*e se@uences, 0e ha!e to define a mappin* /et0een
them and the mo!ement of specific E.# feature points 7E#s8,
0hich correspond to salient points on the human face$
III$ EAATRA ABTRA,TI-N
An o!er!ie0 of the system is *i!en in Ei*ure 1$ #recise
facial feature e:traction is performed resultin* in a set of
mas;s, i$e$ /inary maps indicatin* the position and e:tent of
each facial feature$ The left, ri*ht, top and /ottomFmost
coordinates of the eye and mouth mas;s, the left ri*ht and
top coordinates of the eye/ro0 mas;s as 0ell as the nose
coordinates, to define the considered feature points$ Eor the
nose and each of the eye/ro0s, a sin*le mas; is created$ -n
the other hand, since the detection of eyes and mouth can /e
pro/lematic in lo0?@uality ima*es, a !ariety of methods are
used, each resultin* in a different mas;$ In total, 0e ha!e
four mas;s for each eye and three for the mouth$ These
mas;s ha!e to /e calculated in near?real time6 the
methodolo*ies applied in the e:traction of these mas;s
includeC
A feed?for0ard /ac; propa*ation neural net0or; trained
to identify eye and non?eye facial area$ The net0or; has
thirteen inputs6 for each pi:el on the facial re*ion the
NN inputs are luminance G, chrominance !alues ,r H
,/ and the ten most important .,T coefficients 70ith
zi*za* selection8 of the nei*h/orin* ':' pi:el area$
A second neural net0or;, 0ith similar architecture to
the first one, trained to identify mouth re*ions$
Iuminance /ased mas;s, 0hich identify eyelid and
sclera re*ions$
Ad*e?/ased mas;s$
A re*ion *ro0in* approach to detect re*ions of hi*h
te:ture /ased on standard de!iation
Expression Recognition
Expression
Profiles
Distance
Vector
Construction
Distances
of Neutral
Face
FAP
Estimation
Facial Expression
Decision System
reco*nised
e:pression>
emotional
state
Feature Extraction
Face Detection
Face Pose
Correction
Face
segmentation into
feature-candidate
areas
Mout !oundary
extraction "# Mas$s%
Eye !oundary
extraction "& Mas$s%
Nose Detection
Eye!ro' Detection
Validation(
)eigt
Assignment
Validation(
)eigt
Assignment
Final Eye
Mas$
Nose Mas$
Eye*ro'
Mas$
A
n
t
r
o
p
o
m
e
tr
ic
E
+
a
lu
a
tio
n
Final Mout
Mas$
Feature Points
"FP% ,eneration
C
o
n
fid
e
n
c
e
M
a
s
$
F
u
s
io
n
Neutral Frame -perations
Face Detection
Eye .emplate
Extraction
Mout sape
detection
Ei*ure 1C System -!er!ie0
Since, as 0e already mentioned, the detection of a mas;
usin* any of these applied methods can /e pro/lematic, all
detected mas;s ha!e to /e !alidated a*ainst a set of criteria6
of course, different criteria are applied to mas;s of different
facial features$ Aach one of the criteria e:amines the mas;s
in order to decide 0hether they ha!e accepta/le size and
position for the feature they represent$ This set of criteria
consist of relati!e anthropometric measurements, such as the
relation of the eye and eye/ro0 !ertical positions, 0hich
0hen applied to the correspondin* mas;s produce a !alue in
the ran*e 3(,14 0ith zero denotin* a totally in!alid mas;6 in
this manner, a !alidity confidence de*ree is *enerated for
each one of the initial feature mas;s$ A su/set of the
distances used to form the acceptance criteria of the eyes is
sho0n in the follo0in* e:ampleC
1
d Aye 0idth
2
d
.istance of eye<s middle !ertical coordinate
and eye/ro0<s middle !ertical coordinate
J
d Aye/ro0 0idth
D
d ./p, Kipupil /readth
1
1
1
D
1 1 ($D"
c
eye
d
M
d
_
,
and
2
1
2 J
1
c
eye
M d d
0here
1
1
c
eye
M and
2
1
c
eye
M are the confidence de*rees
ac@uired trou*h the application of each !alidation criterion
on eye mas;
1
eye
M $ The former of the t0o criteria is /ased
on 3&4, 0here the mean ratio of eye 0idth o!er /ipupil
/readth is reported as e@ual to ($D"$ In almost all cases these
!alidation criteria, as 0ell as the other criteria utilized in
mas; !alidation, produce confidence !alues in the 3(,14
ran*e$ In the rare cases that the estimated !alue e:ceeds the
limits, it is set to the closest e:treme !alue, zero for ne*ati!e
!alues and one for !alues e:ceedin* one$
Eor the features for 0hich more than one mas;s ha!e /een
detected usin* different methodolo*ies, the multiple mas;s
ha!e then to /e fused to*ether to produce a final mas;$ The
choice for mas; fusion, rather than simple selection of the
mas; 0ith the *reatest !alidity confidence, is /ased on the
o/ser!ation that the methodolo*ies applied in the initial
mas;s< *eneration produce different error patterns from each
other, since they rely on different ima*e information or
e:ploit the same information in fundamentally different
0ays$ Thus, com/inin* information from independent
sources has the property of alle!iatin* a portion of the
uncertainty present in the indi!idual information
components$ In other 0ords, the final mas;s that are
ac@uired !ia mas; fusion are accompanied /y lesser
uncertainty than each one of the initial mas;s$
The fusion al*orithm is /ased on a .ynamic ,ommittee
Machine structure that com/ines the mas;s /ased on their
!alidity confidence, producin* a final mas; to*ether 0ith the
correspondin* estimated confidence 31'4 for each facial
feature$ Aach of those mas;s represents the /est?effort result
of the correspondin* mas;?e:traction method used$ The
most common pro/lems, especially encountered in lo0
@uality input ima*es, are connection 0ith other feature
/oundaries or mas; dislocation due to noise$ If
comb
y is the
com/ined machine output and t the desired output it has
/een pro!en in the committee machine 7,M8 theory that the
com/ination error
comb
y t from different machines fi is
*uaranteed to /e lo0er than the a!era*e errorC
2 2
2
1
7 8 7 8
1
7 8
comb i
i
i comb
i
y t y t
M
y y
M
,
( )
( )
!d
@
!d
@
1, t
(, t
k k
k k
c,x c,x
f q
k
c,x c,x
f q
M M
h
M M
'
<
Where
x
i
m is the element of mas;
x
i
M ,
i
c,x
f
M the final
!alidation !alue of mas; i and
i
h is used to pre!ent the
mas;s 0ith
( ) !d
@
L t
k k
c,x c,x
f q
M M to contri/ute to the
final mas;$ A sufficient !alue for
!d
t is ($'$ The role of the
*atin* !aria/le
i
g is to fa!or the color?/ased feature
e:traction methods 7
e
1
M ,
m
1
M 8 in ima*es of hi*h color and
resolution$ In this sta*e, t0o !aria/les are ta;en into
accountC ima*e resolution and color @uality6 since non?
synthetic trainin* data for the latter is difficult to ac@uire, in
our first implementation, the *atin* output of !aria/le
i
g is
not trained /ut it is defined manually as follo0sC
other0ise
, 1, 12', L , L
1> , 1, 12', L , L
1,
bp cr cb
i
bp cr cb
n i D t t
g n i D t t
>
>
'
0here
bp
D the /ipupil 0idth in pi:els and cr, cb the
standard de!iation of the Cr, Cb channels respecti!ely inside
the facial area$ It has /een found that cr, cb in the same
ima*e is less than
?J
% 1( for *ood color @uality and much
lar*er for poor @uality ima*es$
7a8
7/8 7c8
7d8 7e8
Ei*ure J$ -ri*inal frame 7a8 and the four
detected mas;s for the eyes in frame
f
1
f
2
f
n
M
+ate
output
y
1
,N
1
y
2
,N
2
y
n
,N
n
Input
N
o
t
i
n
*
*
1
*2
*
n
.ynamic
,ommittee
Machines
J%2' of the 1Alyssa2 se@uence 3&4
Ei*ure D$ Einal mas; for the eyes
Ei*ure %$ All detected feature
points from the final mas;s
IN$ AB#RASSI-N ANAIGSIS
The feature mas;s are used to e:tract the Eeature #oints
7E#s8 considered in the definition of the EA#s, used in this
0or;$ Aach E# inherits the confidence le!el of the final mas;
from 0hich it deri!es6 for e:ample, the four E#s 7top,
/ottom, left and ri*ht8 of the left eye share the same
confidence as the left eye final mas;$ ,ontinuin*, EA#s can
/e estimated !ia the comparison of the E#s of the e:amined
frame to the E#s of a frame that is ;no0n to /e neutral, i$e$ a
frame 0hich is accepted /y default as one displayin* no
facial deformations$ Eor e:ample, EA#
J&
F
(squeeze_l_eyebro! is estimated asC
J& D$% J$11 D$% J$11
n n
F F" F" F" F"
0here
n
i
F" ,
i
F" are the locations of feature point i on the
neutral and the o/ser!ed face, respecti!ely, and
i #
F" F" is the measured distance /et0een feature
points i and # $
Ei*ure 9$ M#A+?D Eeature #oints 7E#s8
-/!iously, the uncertainty in the detection of the feature
points propa*ates in the estimation of the !alue of the EA#
as 0ell$ Thus, the confidence in the !alue of the EA#, in the
a/o!e e:ample, is estimated as
J& D$% J$11
min7 , 8
c c c
F F" F"
-n the other hand, some EA#s may /e estimated in different
0ays$ Eor e:ample, EA#
J1
F is estimated asC
1
J1 J$1 J$J J$1 J$J
n n
F F" F" F" F"
or as
2
J1 J$1 "$1 J$1 "$1
n n
F F" F" F" F"
As ar*ued a/o!e, considerin* /oth sources of information
for the estimation of the !alue of the EA# alle!iates some of
the initial uncertainty in the output$ Thus, for cases in 0hich
t0o distinct definitions e:ist for a EA#, the final !alue and
confidence for the EA# are as follo0sC
1 2
2
i i
i
F F
F
+
>
and in our case 0e define the a!era*e disa*reement /et0een
t0o o/ser!ers #,#' asC
, O O
1
x x
# # # #
bp
D M M
D
0here