You are on page 1of 11

Principles for Lipsync Animation

keith lango, 2001

Introduction...
I recently have been asked by a few people (OK, more than a few) to try and
touch upon the area of facial animation and lipsync. Most of these requests
have come from folks reading my Pose-to-Pose Organized Keyframing tutorial
who then want some ideas on breaking down lipsync and facial animation.

Originally I had replied that for me facial and lipsync was the one area of my
animation that was still undefined for me. By that I mean I hadn't taken the
time to sit down and really think about how I logically approach lipsync and
facial animation. I've always just kinda "done it", letting it flow from within me.
I enjoyed being able to pretty much do a single straight ahead pass at face and
lipsync animation with another single 'tweak' pass and to be able to call it
done. I readily admit that I don't pre-plan my lipsync at all. And I don't spend a
whole lot of time breaking down my facial animation as a whole. I do mark a
few seminal emotions I want to capture, but I don't do anything near as
organized or mechanical as the pop-thru for my body work. Basically, face and
lipsync animation was the last bastion of real heartfelt art for me, and I admit I
was reluctant to quantify that little bit of remaining magic in my art. :o) But
recently I have taken some steps towards actually quantifying this stuff.

As such, I have some thoughts on lip sync that I feel folks might be willing to
check into. Let me preface my words by stating clearly that I do not consider
myself an authority on the topic. My thoughts are pretty much just mine, and
folks may disagree with my assessment of how to approach lipsync. But the
purpose of my efforts is to try and give some concrete "hooks" for animators to
use. I want to avoid having these thoughts coming off as rules or suchlike.
They're merely ideas and theories that may help some folks get their brains
around lipsync in a different way. So with the caveats offered, I expound upon
my particular approach to lipsync.

This paper is not exhaustive, but it does begin to address how I tend to THINK
about lipsync animation conceptually. I am limiting my comments in this paper
to specifically lipsync animation. I am currently developing my thoughts for
another paper on facial animation as a whole, the sum of which will enfold this

paper's topics into itself for a holistic approach to animating character's faces
with convincing speech and emotional acting.

In the Beginning...
Lipsync is a tricky thing to get the hang of at first. Many an animation shows
the classic example of how just about everybody approaches it at first. The
tendency is this:

1) make 'sound' targets for 'sounds' like M and E and S and Th and F and such.
(some folks even go so far as to make targets for such 'sounds' as H and G and
J and Z).
2) listen to the sound track
3) for every 'sound' you hear, hit the 'sound' target at or near 100%
4) Make a preview render of the lipsync animation
5) watch the mouth flap out of control
6) wonder what went wrong.

At least that's how it went for me at first. The problem is being too literal about
animating a character talking, trying to animate the letters in the words instead
of only emphasizing the major sounds needed to communicate the *idea* of
speech..

There's No Such Thing As Letters in Speech...

Notice how I kept putting the word 'sound' in quotes above? That's because a
common mistake for beginners is to associate LETTERS with SOUNDS.

Principle #1: Letters are not sounds. Sounds are not letters. There are NO
letters in lipsync animation.

They serve similar roles, but in wildly divergent forms. LETTERS are
representative symbols on a page (with a corresponding, arbitrarily assigned
sound) that, when strung together to form words, communicate a thought. But
letters aren't made for speech. They're for writing. And we're not animating
writing, but speech. SOUNDS are utterances (with a corresponding arbitrarily
assigned letter value used to transcribe the sound) that, when interpreted as
understood words, communicate a thought. Sounds are for speech, but serve
no use in writing. See the similarities and differences? So when you animate
speech, don't animate letters. There are no letters in speech, only sounds, and
the shape our faces take to make those sounds.
I know this sounds like an argument in semantics, but trust me, the distinction
is very real. And when you learn to approach lipsync animation from the
perspective of animating sound shapes instead of letters, your world will be a
much brighter place.

So What Does that Mean For Animation?


Let's take a look at an example: the line "you hafta get" from the 10-second
Club's November 2001 soundtrack takes about 25 frames to say. At first look, it
seems like there ought to be the following keys for the phrase:
Y (a pucker shape)
Ooo
H
Aa

V
T
Uh
G
Eh
T

That is a very literal interpretation of what it takes to show a person saying


"you hafta get". But if you go ahead and keyframe the lipsync that way, you'll
soon realize that this will result in a very poppy mouth when animated. Some of
those poses will be onscreen for only a single frame, which is too much
information and not enough time for the viewer to interpret it. A quick analysis
will show that you go from one mouth shape that is quite open (Ah in hafta) to
a pretty closed one (the F in hafta) and then back open again (for the end of
hafta). The result is the mouth popping from open to closed back to open in just
3 frames. That's not fun to watch, folks.

But What About My Letter..um... I Mean "Sound" Shapes?


Often times beginners will make a 'phoneme' that is an exact replication of
one's face saying that single 'letter' in isolation. So we make E phonemes
saying E by itself. And we model "K" phonemes based off our own face in a
mirror saying "kuh". At first that seems more than logical enough. The problem
with that is that when you say the "t" sound by itself ('tuh'), your face doesn't
look at all like it would if you say something like "skate". And that "t" in 'skate'
gives a face shape that is completely different than the "t" sound shapes in
"petstore". And THAT "t" is very different from the "t" shape you make when
you say "goatee".

Principle #2: Mouth Shapes for Sounds Must Be Animated In Context

By context I mean this:

The preceding sound shape affects the current sound shape. Likewise, the
following sound shape is anticipated in the current sound shape.
So the shapes shown must all be in context with the shape/sound the preceds it
and follows it. When you get stuck on the idea of making all the "t" sounds in a
soundtrack the same shape, regardless of the prior or following sound/shape
context in the dialogue, then you're setting yourself up for a very poppy mouth
when animated. Remember Rule#1- animating speech is not animating letters.
It's animating the *flow* of shapes that are needed to make the present sounds
within what's being communicated.

OK, Mr. Fancypants. So Just How Should I Animate Lipsync?


The better approach is to interpret speech, to grasp the essential elements of
the communication as recorded in the sound track. To "squint your ears" and
try and pick up the overall feel of the speech.
Let's take a look at art history.
For many years up until the late 19th century, the effort in rennaissance art
was the meticulous and accurate recreation of reality. Realism was the goal,
and literalism in interpreting a painting was the norm. Then a bunch of artists
got an idea about capturing just the overall sense of an image. They became
less interested in capturing every leaf on a tree, but began to focus on how the
light and shadow and color hues projected that tree into another realm. This
new realm of seeing was an interpretive realm where leaves didn't matter as
much as form, color, tone and contrast. At first these guys were derided as lazy
artists, too shiftless to bother with the details. But soon the world got hold of
these new paintings and were amazed to see such life and beauty where before
there was just leaves. The age of Impressionism was born, and we're all the
better off for it.

So how does that apply to us and lipsync?

Here's how: Just as the impressionist painters got away from a literal realism in
capturing a picture, we too need to get impressionistic when it comes to lipsync
animation.

Principle #3: Interpret the Lipsync Animation Like an Impressionist

If in your animation you can just get the major impressions across you can let
the little stuff slide if you want. Just like the impressionist would hint at a
cluster of leaves with a single daub of his brush, you too should let words and
sound shapes slur into the next word or sound shape. Mix the target facial
weights together to show a flow. Get away from showing leaves and start
showing contrast and form. Talking is more of a flowing thought than an
alliterative function of letters.

Impressionism Applied To Real Live LipSync...


Let's look again at our example phrase- "you hafta get". A more impressionistic
interpretation would be to emphasize the following major accents:

Ooo
aaFF
Eh

Go ahead and say that out loud. "Ooo" as in "scoop", "aaFF" as in "after" and
"Eh" as in "pet".

Ooo--aaFF--Eh.

Sounds alot like "you hafta get", doesn't it?


Now go one further.
Grab a handheld mirror.

Now, comfortably (ie: don't play act or over emphasize it), just say "you hafta
get".
Watch how your mouth looks as you say it again.
Now, say "oo-aaFF-eh" a few times.

See how very close the two are in how they look? You want another example of
this same principle?

Say to your mirror "I love you".


Then say to it "Elephant Shoes".

You never knew that the connection between la' mour and pachydermal
podiatry was this close!

The Devil is in the Details...


Let's take an even closer look at this from a lipsync animation point of view. For
the phrase "you hafta get" there is one special pose along with two major open
poses and two major closed poses.
The special pose is the pucker/ooo at the beginning of You.
The first major open is the "aa" at the beginning of Hafta.
The second major open pose is the "Eh" of Get.
Likewise, the first major closed pose is the FF of Hafta.
The second closed pose is the T in Get. (It's not a true closed pose, but it's
close enough for us to define it as such because it is more closed than open.)
Anyhow, by choosing to do nothing more than hit these opens and closes you
can get nearly all you need. (heck, the Muppets have gotten by on that for 30+
years!) These main target points are like the broad brushes in an impressionist
painting. They define shape, contrast, form, direction. The details of texture

come later with the specific choices you make on top of the broad brushed
open and closed pose shapes and timings. The opens and closes are the
foundation of your more specific choices.

Principle #4: Get the Opens and Closes Done Right and Build On Those

Even if all you ever do is properly hit the opens and closes and wide shapes of
the mouth at the right time you are already more than 75% of the way to great
lipsync. You can get alot out of very little lipsync animation. And if you doubt it,
animated properties with projected texture map mouths like "Veggietales" have
proven that this is indeed true.

Getting Specific...
Here's a breakdown of some specific choices...
You'll want to start by letting the "Yuh" of You flow into the more open "aa" at
the beginning of Hafta. Skip the specific "ooo" at the end of You because it is
not very strong. It's there, but it gets said while the mouth is transitioning into
the beginning of hafta. Basically it slurs into the next word.

The H of Hafta is burried in the back of the throat, so the lips don't really need
to show it. So skip showing a specific H target for it.

Picking up from the moderately strong "aa" of hafta, hit the F for two frames to
let it read. It's the major closed point of the phrase, so that needs to line up
and read clearly.

Then skip the ending "ah" of hafta altogether, as well as the G of Get. Both
happen under the breath, they're slurred under the transition from FF to the Eh
accent of Get.

Hit that last open pose of Eh.

Then end with an appropriately shaped nearly closed mouth to catch the idea
of a T.

You've basically now animated Ooo-aaFF-Eht. And you know what? It's enough.
And the best part is it flows, it feels natural, and it doesn't pop.

There's Gotta be More. What about those T's and Stuff?


The short answer to this question is: don't sweat it unless you really need to. I
haven't at all addressed the tongue in any of this. But if your character has a
tongue, then you can get all the inner mouth sound shapes you need with that.
The inner mouth sound shapes are:

L
Th
T
K
G (hard)

So add your tongue work in here, keeping it as impressionistic as everything


else, and you can handle the 'little stuff' quite easily. A good tip is to keep
tongue movements very quick. Don't have the tongue take longer than 2
frames to get from a position back to another, unless you have a specific
reason. Else wise it will look for all the world like your character is saying the
"LL" sound. The word "bad" turns into "bald". "Good" becomes "gold". Keep the
tongue light and quick, just like your wits.

Miscellaneous Tips & Tricks & Principles...


1) Don't go from wide open to closed in one frame and vice versa. Definitely
don't go from open to closed to open in 3 frames.
2) Don't hold a mouth shape static. An "Ah" shape should shift into a slightly
different "Ah" as it's being held.
3) Keep M's and F's for 2 frames. If it's tight, steal from the previous sound.
4) Keep and eye on your targets and make sure they're not too linear in going
from one sound shape to the next.
5) Hit the sound shape at least 2 frames before the sound is heard. Even if
you're right on the nose, it will feel late when played at full speed. Humans see
things faster than they hear them, so we pick up our cues from the shape
before the sound.
6) Break up the mouth angles. Shift the mouth up and down, tilt it left or right,
get some snarls in there. Show emotion as the character speaks. We can speak
and smile, speak and frown, speak and yawn at the same time. Built rigs that
allow you to keep that kind of life in your lipsync animation.
7) Upper teeth do not move. They're nailed to your skull.
8) Jaws rotate, not slide, in chaarcters with clearly defined head/neck areas.
9) When building your sound shapes and facial controls, don't forget the cheeks
and the nose! The cheeks move when we speak, as does our nose. The cheeks
and nose are the great connectors in facial animation, crossing the bridge from
mouth animation to eye and brow animation. By keeping your nose and cheeks
in the action you tie together the entire face of the character, creating a far
more believable character who can act.
10) Don't be afraid to go extreme. Avoid the Princess Fiona Final Fantasy
Syndrome(tm). Keep the energy of the sound track in mind when you're doing
the mouth shapes. Louder sounds with more energy should be shown with the
mouth open wider, sound shapes more extreme. Watch TV announcers talk.
Those faces are movin' baby!

Before You Go...

I hope this has helped some. We've broken down one phrase for this paper and
I'm sure it all makes perfect sense now- for that one phrase. :o)
Now the trick for you is to learn how to adapt this impressionist kind of thinking
into other phrases, other animations, other characters. Just try to keep in mind
my four "Principles" that I've stated. If you can keep those in mind then you're
well on your way to animating lipsync in a convincing, flowing manner that will
feel natural and have life. Last of all, the best thing I can suggest is that you
keep practicing. My breakdown can get you going in the right direction, but
experience is the best teacher.

-keith

You might also like