You are on page 1of 9

Online Latent Dirichlet Allocation with Infinite Vocabulary

Ke Zhai zhaike@cs.umd.edu
Department of Computer Science, University of Maryland, College Park, MD USA
Jordan Boyd-Graber jbg@umiacs.umd.edu
iSchool and UMIACS, University of Maryland, College Park, MD USA
Abstract et al., 2011; Mimno et al., 2012) make the same lim-
Topic models based on latent Dirichlet alloca- iting assumption. The namesake topics, distributions
tion (LDA) assume a predefined vocabulary. over words that evince thematic coherence, are always
This is reasonable in batch settings but not modeled as a multinomial drawn from a finite Dirich-
reasonable for streaming and online settings. let distribution. This assumption precludes additional
To address this lacuna, we extend LDA by words being added over time.
drawing topics from a Dirichlet process whose Particularly for streaming algorithms, this is neither
base distribution is a distribution over all reasonable nor appealing. There are many reasons
strings rather than from a finite Dirichlet. We immutable vocabularies do not make sense: words
develop inference using online variational in- are invented (“crowdsourcing”), words cross languages
ference and—to only consider a finite number (“Gangnam”), or words common in one context become
of words for each topic—propose heuristics to prominent elsewhere (“vuvuzelas” moving from music
dynamically order, expand, and contract the to sports in the 2010 World Cup). To be flexible, topic
set of words we consider in our vocabulary. models must be able to capture the addition, invention,
We show our model can successfully incorpo- and increased prominence of new terms.
rate new words and that it performs better
than topic models with finite vocabularies in Allowing models to expand topics to include additional
evaluations of topic quality and classification words requires changing the underlying statistical for-
performance. malism. Instead of assuming that topics come from a
finite Dirichlet distribution, we assume that it comes
from a Dirichlet process (Ferguson, 1973) with a base
1. Introduction distribution over all possible words, of which there are
an infinite number. Bayesian nonparametric tools like
Latent Dirichlet allocation (LDA) is a probabilistic the Dirichlet process allow us to reason about distri-
approach for exploring topics in document collec- butions over infinite supports. We review both topic
tions (Blei et al., 2003). Topic models offer a formalism models and Bayesian nonparametrics in Section 2. In
for exposing a collection’s themes and have been used Section 3, we present the infinite vocabulary topic model,
to aid information retrieval (Wei & Croft, 2006), un- which uses Bayesian nonparametrics to go beyond fixed
derstand academic literature (Dietz et al., 2007), and vocabularies.
discover political perspectives (Paul & Girju, 2010).
In Section 4, we derive approximate inference for our
As hackneyed as the term “big data” has become, re- model. Since emerging vocabulary are most important
searchers and industry alike require algorithms that in non-batch settings, in Section 5, we extend inference
are scalable and efficient. Topic modeling is no differ- to streaming settings. We compare the coherence and
ent. A common scalability strategy is converting batch effectiveness of our infinite vocabulary topic model
algorithms into streaming algorithms that only make against models with fixed vocabulary in Section 6.
one pass over the data. In topic modeling, Hoffman
Figure 1 shows a topic evolving during inference. The
et al. (2010) extended LDA to online settings.
algorithm processes documents in sets we call mini-
However, this and later online topic models (Wang batches; after each minibatch, online variational infer-
ence updates our model’s parameters. This shows that
Proceedings of the 30 th International Conference on Ma- out of vocabulary words can enter topics and eventually
chine Learning, Atlanta, Georgia, USA, 2013. JMLR:
W&CP volume 28. Copyright 2013 by the author(s). become high probability words.
Online Latent Dirichlet Allocation with Infinite Vocabulary

minibatch-2 minibatch-3 minibatch-5 minibatch-8 minibatch-10 minibatch-16 minibatch-17 minibatch-39 minibatch-83 minibatch-120
... ... ... ... ... ... ... ...
0-captain 0-appear
100-issu 90-club 102-club 118-club 132-rock 87-seri 82-seri 1-annual
... ... ... ... ... ... ...
1-appear 1-hulk
2-rock
146-cover 105-issu 115-issu 128-copi 194-issu 161-issu 162-issu
... ... ... ... ... ... ... 2-crohn 2-wolverin
3-wolverin
196-admir 161-cover 127-cover 137-cover 215-seri 283-copi 288-copi
... ... ... ... ... ... 3-hulk 3-annual
4-appear
138-issu ...
199-copi 214-copi 130-copi 217-copi 306-appear 294-appear
... 4-copi
... ... ... ... ... 5-comicstrip 5-rock
180-appear 307-cover
229-appear 244-appear 197-appear 226-cover 311-cover 5-rider
... ... 6-seri 6-wolverin
... ... ... ... ...
319-rock 502-annual 6-comicstrip
324-club 288-rock 289-rock 261-appear 512-annual 7-mutant 7-bloom
... ...
... ... ... ... ...
7-cover
493-annual 814-forc 8-cover 8-forc
360-annual 381-annual 450-annual 588-annual 830-forc
... ... ...
... ... ... ... ... 8-forc
9-comicstrip
639-seri 1194-rider 12-issu
643-rock 685-seri 584-seri 949-forc 4782-wolverin ...
... ... ... 9-captain
... ... ... ... ...
11-hiram
877-forc 8516-bloom 14-hulk
819-forc 791-forc 811-forc 1074-rider 9231-bloom
... ... ... 10-bloom
... ... ... ... ... 12-annal
1003-rider 8944-hulk 16-copi
1064-rider 1091-rider 1090-rider 6038-comicstrip 9659-hulk 11-issu
... ... ... 13-mutant
... ... ... ... ...
7075-captain 10819-comicstrip 53-forc ... 12-seri
1185-seri 4639-hiram 6267-crohn 6520-mutant 11527-comicstrip
... ... ... 15-seri ...
... ... ... ... ...
9420-crohn 11301-mutant 57-rider 16-mutant
7113-hiram 9569-captain 12009-mutant
... ... ... 16-cover ...
Settings ... ... ...
...
Number of Topics: K = 50 10266-hiram
11914-crohn
14335-captain
15040-captain
86-captain 37-crohn
19-copi
Truncation Level: T = 20K ...
...
...
...
... ...
...
Minibatch Size: S = 155 12760-hiram
16676-crohn
17381-crohn
3531-hiram 41-rock
β 23-issu
DP Scale Parameter: α = 5000 ...
...
...
... ...
...
Reordering Delay: U = 20 17519-hiram
18224-hiram
3690-bloom 43-hiram
Learning Inertia: τ0 = 256 ...
...
... 280-rider ...
...
Learning Rate: κ = 0.6 3915-crohn
...

New-Words 2 New-Words 3 New-Words 5 New-Words 8 New-Words 10 New-Words 16 New-Words-17 New-Words-39 New-Words-83

hiram crohn captain comicstrip bloom wolverin laci izzo gown


... ... ...
moskowitz corpu seqitur mutant hulk albion
... ... ... ...
patlafontain mazelyah
... ... words added at corresponding minibatch

Figure 1. The evolution of a single “comic book” topic from the 20 newsgroups corpus. Each column is a ranked list of
word probabilities after processing a minibatch (numbers preceding words are the exact rank). The box below the topics
contains words introduced in a minibatch. For example, “hulk” first appeared in minibatch 10, was ranked at 9659 after
minibatch 17, and became the second most important word by the final minibatch. Colors help show words’ trajectories.

2. Background 2.1. Bayesian Nonparametrics


Latent Dirichlet allocation (Blei et al., 2003) assumes Bayesian nonparametrics is an appealing solution; it
a simple generative process. The K topics, drawn from models arbitrary distributions with an unbounded and
a symmetric Dirichlet distribution, βk ∼ Dir(η), k = possibly countably infinite support. While Bayesian
{1, . . . , K} generate a corpus of observed words: nonparametrics is a broad field, we focus on the Dirich-
1: for each document d in a corpus D do let process (DP, Ferguson 1973).
2: Choose a distribution θd over topics from a The Dirichlet process is a two-parameter distribution
Dirichlet distribution θd ∼ Dir(αθ ). with scale parameter αβ and base distribution G0 . A
3: for each of the n = 1, . . . , Nd word indexes do draw G from DP(αβ , G0 ) is modeled as
4: Choose a topic zn from the document’s distri-
bution over topics zn ∼ Mult(θd ). b1 , . . . , bi , . . . ∼ Beta(1, αβ ), ρ1 , . . . , ρi , . . . ∼ G0 .
5: Choose a word wn from the appropriate topic’s Individual draws from a Beta distribution are the
distribution over words p(wn |βzn ). foundation for the stick-breaking construction of the
Implicit in this model is a finite number of words in DP (Sethuraman, 1994). Each break point bi mod-
the vocabulary because the support of the Dirichlet els how much probability mass remains. These break
distribution Dir(η) is fixed. Moreover, it fixes a priori points combine to form an infinite multinomial,
which words we can observe, a patently false assump- i−1
Y X
tion (Algeo, 1980). β i ≡ bi (1 − bj ), G≡ βi δρi , (1)
j=1 i
Online Latent Dirichlet Allocation with Infinite Vocabulary
Q
where the weights βi give the probability of selecting 4: Set stick weights βkt = bkt s<t (1 − bks ).
any particular atom ρi from the base distribution.
The rest is identical to LDA.
The model we develop in Section 3 uses a base distri-
bution over all possible words, and each topic is a draw 3.1. A Distribution over Words
from the Dirichlet process. This approach is inspired
by unsupervised models that induce parts-of-speech. An intuitive choice for G0 is a conventional character
language model. However, such a naı̈ve approach is
2.2. N-gram Models in Latent Variable Models unrealistic and is biased to shorter words; preliminary
experiments yielded poor results. Instead, we define
A strength of the probabilistic formalism is the abil- G0 as the following distribution over strings
ity to embed specialized models inside more general 1: Choose a length l ∼ Mult(λ).
models. The problem of part-of-speech (POS) induc- 2: Generate character ci ∼ p(ci |ci−n,...,i−1 ).
tion (Goldwater & Griffiths, 2007) uses morphological
regularity within part of speech classes (e.g., verbs in This is similar to the classic n-gram language model,
English often end with “ed”) to learn a character n- except that the length is first chosen from a multinomial
gram model for parts of speech (Clark, 2003). This has distribution over all lengths. Estimating conditional
been combined within the latent variable HMM via a n-gram probabilities is well-studied in natural language
Chinese restaurant process (Blunsom & Cohn, 2011). processing (Jelinek & Mercer, 1985).
We also view latent clusters of words (topics) as a non- The full expression for the probability of a word ρ
parametric distribution with a character n-gram base consisting of the characters c1 , c2 , . . . under G0 is
distribution, but to better support streaming data sets, Q|ρ|
we use online variational inference; previous approaches G0 (ρ) ≡ pWM (l = |ρ| | λ) i=1 p(ci |ci−n,...,i−1 )
used Monte Carlo methods (Neal, 1993). Variational
inference is easier to distribute (Zhai et al., 2012) and where |ρ| is the length of the word. To avoid length
amenable to online updates (Hoffman et al., 2010). bias, we chose the multinomial λ that minimizes the
average discrepancy between word corpus probabilities
Within the topic modeling community, there are dif- pC and the probability in our word model
ferent approaches to deal with changing word use.
Dynamic topic models (Blei & Lafferty, 2006) dis- λ ≡ arg minλ
P
|pC (ρ) − pWM (ρ| λ)|2 , s.t.
P
λl = 1.
ρ l
cover evolving topics by viewing word distributions
as n-dimensional points undergoing Brownian motion. The n-gram statistics are estimated from an English
These models reveal compelling topical evolution; e.g., dictionary which need not be very large, since it is a
physics moving from studies of the æther to relativity language model over characters, not words.
to quantum mechanics. However, the models assume
fixed vocabularies; we show that our infinite vocabu-
4. Variational Approximation
lary model discovers more coherent topics (Section 6.2).
An elegant solution for large vocabularies is the “hash- Inference in probabilistic inference uncovers the latent
ing trick” (Weinberger et al., 2009), which maps strings variables that best reconstruct observed data. The
into a restricted set of integers via a hash function. quality of this reconstruction is measured by log like-
These integers become the topic model’s vocabulary. lihood. For a corpus of D documents where the d-th
While elegant, words are no longer identifiable. How- document contains Nd words, the joint distribution is
ever, our infinite vocabulary topic model retains iden- QK Q∞ β

tifiability and better models datasets (Section 6.3).
p(W , ρ, β, θ, z) = k=1 t=1 p(ρkt |G0 ) · p(βkt |α )
hQ i
D θ
QNd
d=1 p(θd |α ) n=1 p(zdn |θd )p(ωdn |zdn , βzdn ) .
3. Infinite Vocabulary Topic Model
Directly optimizing the latent variables Z ≡ {corpus-
Our generative process is identical to LDA’s (Section 2) level stick proportions β, document topic distributions
except that topics are not drawn from a finite Dirich- θ and word topic assignments z} is intractable, so we
let. Instead, topics are drawn from a DP with base use variational inference (Blei et al., 2003).
distribution G0 over all possible words:
To use variational inference, we select a simpler family
1: for each topic k do
of distributions over the latent variables Z. We call
2: Draw words ρkt , (t = {1, 2, ...}) from G0 .
these distributions q. This family of distributions allows
3: Draw bkt ∼ Beta(1, αβ ), (t = {1, 2, . . . }).
us to optimize a lower bound of the likelihood called
Online Latent Dirichlet Allocation with Infinite Vocabulary

the evidence lower bound (ELBO) L, distributions of interest. For q(z | η), we use local col-
lapsed MCMC sampling (Mimno et al., 2012) and for
log p(W ) ≥ Eq(Z) [log p(W , Z)] − Eq(Z) [q] = L. (2) q(b | ν) we use stochastic variational inference (Hoffman
et al., 2010). We describe both in turn.
Maximizing L is equivalent to minimizing the Kullback-
Leibler (KL) divergence between the true distribution
4.2. Stochastic Inference
and the variational distribution.
Recall that the variational distribution q(zd | η) is a
Unlike mean-field approaches (Blei et al., 2003), which
single distribution over the Nd vectors of length K.
assume q is a fully factorized distribution, we integrate
While this removes the tight coupling between θ and z
out the word-level topic distribution vector θ: q(zd | η)
that often complicates mean-field variational inference,
is a single distribution over K Nd possible topic con-
it is no longer as simple to determine the variational
figurations rather than a product of Nd multinomial
distribution q(zd | η) that optimizes Eqn. (2). However,
distributions over K topics. Combined with a beta
1 2 Mimno et al. (2012) showed that Gibbs sampling in-
distribution q(bkt |νkt , νkt ) for stick break points, the ∗
stantiations of zdn from the distribution conditioned
variational distribution q is
on other topic assignments results in a sparse, effi-
q(Z) ≡ q(β, z) = D q(zd | η) K q(bk | νk1 , νk2 ). (3)
Q Q cient empirical estimate of the variation distribution.
In our model, the conditional distribution of a topic
However, we cannot explicitly represent a distribution assignment of a word with TOS index t = Tk (wdn ) is
over all possible strings, so we truncate our variational
q(zdn = k|z−dn , t = Tk (wdn )) (4)
stick-breaking distribution q(b | ν) to a finite set.  
PNd θ

∝ m=1 Izdm =k + αk exp Eq(ν) [log βkt ] .
4.1. Truncation Ordered Set m6=n

Variational methods typically cope with infinite dimen- We iteratively sample from this conditional distribution
sionality of nonparametric models by truncating the to obtain the empirical distribution φdn ≡ q̂(zdn ) for
distribution to a finite subset of all possible atoms that latent variable zdn , which is fundamentally different
nonparametric distributions consider (Blei & Jordan, from mean-field approach (Blei et al., 2003).
2005; Kurihara et al., 2006; Boyd-Graber & Blei, 2009). There are two cases to consider for computing Eqn. (4)—
This is done by selecting a relatively large truncation whether a word wdn is in the TOS for topic k or not.
index Tk , and then stipulating that the variational dis- First, we look up the word’s index t = Tk (wdn ). If this
tribution uses the rest of the available stick at that word is in the TOS, i.e., t ≤ Tk , the expectations are
index, i.e., q(bTk = 1) ≡ 1. As a consequence, β is zero straightforward (Mimno et al., 2012)
in expectation under q beyond that index.  
PNd θ 1
However, directly applying such a technique is not q(zdn = k) ∝ m=1 φdmk + αk · exp{Ψ(νkt ) (5)
m6=n
feasible here, as truncation is not just a search over Ps<t 2
Ps≤t 1 2
+ s=1 Ψ(νks )− s=1 Ψ(νks + νks )}
dimensionality but also over atom strings and their
ordering. This is often a problem in for nonparametric It is more complicated when a word is not in the
models, and the truncation that solves the problem TOS. Wang & Blei (2012) proposed a truncation-free
matches the underlying probabilistic model: for mix- stochastic variational approach for DPs. It provides
ture models, it is the number of components (Blei & more flexible truncation schemes than split-merge tech-
Jordan, 2005); for hierarchical topic models, it is a niques (Wang & Blei, 2009). The algorithm resem-
tree (Wang & Blei, 2009); for natural language gram- bles a collapsed Gibbs sampler; it does not represent
mars, it is grammatons (Cohen et al., 2010). Similarly, all components explicitly. For our infinite vocabu-
our truncation is not just a fixed vocabulary size; it lary topic model, we do not ignore out of vocabulary
is a truncation ordered set (TOS). The ordering is (OOV) words;
P we assign  these unseen
words proba-
important because the Dirichlet process is a size-biased bility 1 − t≤Tk exp Eq(ν) [log βkt ] . The conditional
distribution; words with lower indices are likely to have distribution of an unseen word (t > Tk ) is then
a higher probability than words with higher indices.  
PNd θ
q(zdn = k) ∝ m=1 φdmk + αk (6)
Each topic has a unique TOS Tk of limited size that m6=n
maps every word type w to an integer t; thus t = Tk (w) Ps≤t 2 1 2

· exp{ s=1 Ψ(νks ) − Ψ(νks + νks ) }.
is the index of the atom ρkt that corresponds to w. We
defer how we choose this mapping until Section 4.3. This is different from finite vocabulary topic models
More pressing is how we compute the two variational that set vocabulary a priori and ignore OOV words.
Online Latent Dirichlet Allocation with Infinite Vocabulary

4.3. Refining the Truncation Ordered Set 5.1. Updating the Truncation Ordered Set
In this section, we describe heuristics to update the A nonparametric streaming model should allow the
TOS inspired by MCMC conditional equations, a com- vocabulary to dynamically expand as new words ap-
mon practice for updating truncations. One component pear (e.g., introducing “vuvuzelas” for the 2010 World
of a good TOS is that more frequent words should come Cup), and contract as needed to best model the data
first in the ordering. This is reasonable because the (e.g., removing “vuvuzelas” after the craze passes). We
stick-breaking prior induces a size-biased ordering of describe three components of this process, expanding
the clusters. This has previously been used for trun- the truncation, refining the ordering of TOS, and con-
cation optimization for Dirichlet process mixtures and tracting the vocabulary.
admixtures (Kurihara et al., 2007).
Another component of a good TOS is that words con- Determining the TOS Ordering This process de-
sistent with the underlying base distribution should pends on the ranking score of a word in topic k at
be ranked higher than those not consistent with the minibatch i, Ri,k (ρ). Ideally, we would compute R
base distribution. This intuition is also consistent with from all data. However, only a single minibatch is
the conditional sampling equations for MCMC infer- accessible. We have a per-minibatch rank estimate
ence (Müller & Quintana, 2004); the probability of PNd
D
P
creating a new table with dish ρ is proportional to ri,k (ρ) = p(ρ|G0 ) · |Si | d∈Si n=1 φdnk δωdn =ρ
αβ G0 (ρ) in the Chinese restaurant process.
which we interpolate with our previous ranking
Thus, to update the TOS, we define the ranking score
of word t in topic k as
Rik (ρ) = (1 − ) · Ri−1,k (ρ) +  · rik (ρ). (10)
Nd
D X
X
R(ρkt ) = p(ρkt |G0 ) φdnk δωdn =ρkt , (7)
d=1 n=1 We introduce an additional algorithm parameter, the
reordering delay U . We found that reordering after
sort all words by the scores within that topic, and then every minibatch (U = 1) was not effective; we explore
use those positions as the new TOS. In Section 5.1, we the role of reordering delay in Section 6. After U
present online updates for the TOS. minibatches have been observed, we reorder the TOS
for each topic according to the words’ ranking score
5. Online Inference R in Eqn. (10); Tk (w) becomes the rank position of w
according to the latest Rik .
Online variational inference seeks to optimize the
ELBO L according to Eqn. (2) by stochastic gradi-
ent optimization. Because gradients estimated from Expanding the Vocabulary Each minibatch con-
a single observation are noisy, stochastic inference for tains words we have not seen before. When we see
topic models typically uses “minibatches” of S docu- them, we must determine their relative rank position
ments out of D total documents (Hoffman et al., 2010). in the TOS, their rank scores, and their associated
variational parameters. The latter two issues are rele-
An approximation of the natural gradient of L with vant for online inference because both are computed
respect to ν is the product of the inverse Fisher infor- via interpolations from previous values in Eqn. (10)
mation and its first derivative (Sato, 2001) and (9). For an unseen word ω, previous values are
1 D
P PNd 1
undefined. Thus, we set Ri−1,k for unobserved words
∆νkt = 1 + |S| d∈S n=1 φdnk δωdn =ρkt − νkt (8) to be 0, ν to be 1, and Tk (ω) is Tk + 1 (i.e., increase
2 D
PNd truncation and append to the TOS).
= αβ + |S| 2
P
∆νkt d∈S n=1 φdnk δωdn >ρkt − νkt ,

which leads to an update of ν, Contracting the Vocabulary To ensure tractabil-


1
νkt = 1
νkt +· 1
∆νkt , 2
νkt = 2
νkt +· 2
∆νkt (9) ity we must periodically prune the words in the TOS.
When we reorder the TOS (after every U minibatches),
where i = (τ0 + i)−κ defines the step size of the algo- we only keep the top T terms, where T is a user-defined
rithm in minibatch i. The learning rate κ controls integer. A word type ρ will be removed from Tk if its in-
how quickly new parameter estimates replace the old; dex Tk (ρ) > T and its previous information (e.g., rank
κ ∈ (0.5, 1] is required for convergence. The learn- and variational parameters) is discarded. In a later
ing inertia τ0 prevents premature convergence. We minibatch, if a previously discarded word reappears, it
recover the batch setting if S = D and κ = 0. is treated as a new word.
Online Latent Dirichlet Allocation with Infinite Vocabulary

T : 4000 T : 5000 T : 4000 T : 5000 T : 20000 T : 30000 T : 40000 T : 20000 T : 30000 T : 40000
150 80

U : 10 U : 20

U : 10 U : 20
60

U : 10 U : 20

U : 10 U : 20
60 60 100 40
40 40 50 20
20

pmi
20

pmi
0 0
0
pmi

pmi
0 150 80
100 60
60 60 50 40
40 40 20
20 20 0 0
0

00

00

0
0

40
80

40
80

40
80
0

12

12

12

20
40
60

20
40
60

20
40
60
0

0
20

40

60

20

40

60

10
20
30
40

10
20
30
40
0

0
minibatch minibatch
minibatch minibatch infvoc: α β
ab=3k infvoc: α β
ab=5k infvoc: α β
ab=3k infvoc: α β
ab=5k
β β β β

settings
settings
infvoc: α
ab=1k infvoc: α
ab=2k infvoc: α
ab=1k infvoc: α
ab=2k infvoc: α β
ab=4k infvoc: α β
ab=4k
settings
settings
(a) de-news, S = 140. (b) de-news, S = 245. (c) 20 newsgroups, S = 155. (d) 20 newsgroups, S = 310.

Figure 2. PMI score on de-news (Figure 2(a) and 2(b), K = 10) and 20 newsgroups (Figure 2(c) and 2(d), K = 50) against
different settings of DP scale parameter αβ , truncation level T and reordering delay U , under learning rate κ = 0.8 and
learning inertia τ0 = 64. Our model is more sensitive to αβ and less sensitive to T .

! : 0.6 ! : 0.7 ! : 0.8 ! : 0.9 !:1


6. Experimental Evaluation 80

"0 : 256 "0 : 64


60
40
20
In this section, we evaluate the performance of our 0

pmi
80
infinite vocabulary topic model (infvoc) on two cor- 60
40
pora: de-news 1 and 20 newsgroups.2 Both corpora 20
0

were parsed by the same tokenizer and stemmer with

10
20
30
40

10
20
30
40

10
20
30
40

10
20
30
40

10
20
30
40
0

0
minibatch
a common English stopword list (Bird et al., 2009). infvoc: α β
ab=1k T=4k infvoc: α β
ab=2k T=4k

settings
First, we examine its sensitivity to both model param- infvoc: α β
ab=1k T=5k infvoc: α β
ab=2k T=5k
eters and online learning rates. Having chosen those (a) de-news, S = 245 and K = 10.
parameters, we then compare our model with other
topic models with fixed vocabularies. ! : 0.6 ! : 0.7 ! : 0.8 ! : 0.9 !:1
150

"0 : 256 "0 : 64


100
50
pmi

0
150
Evaluation Metric Typical evaluation of topic mod- 100
50
els is based on held-out likelihood or perplexity. How- 0
0
05

0
05

0
05

0
05

0
5
25
50
75

25
50
75

25
50
75

25
50
75

25
50
75
0

10
12

10
12

10
12

10
12

10
12
ever, creating a strictly fair comparison for our model minibatch
against existing topic model algorithms is difficult, as infvoc: α β
ab=3k T=20k infvoc: α β
ab=4k T=20k infvoc: α β
ab=5k T=20k
settings
traditional topic model algorithms must discard words (b) 20 newsgroups, S = 155 and K = 50.
that have not previously been observed. Moreover,
held-out likelihood is a flawed proxy for how topic Figure 3. PMI score on two datasets with reordering delay
models are used in the real world (Chang et al., 2009). U = 20 against different settings of decay factor κ and τ0 .
Instead, we use two evaluation metrics: topic coherence A suitable choice of DP scale parameter αβ increases the
and classification accuracy. performance significantly. Learning parameters κ and τ0
jointly define the step decay. Larger step sizes promote
Pointwise mutual information (PMI), which correlates
better topic evolution.
with human perceptions of topic coherence, measures
how words fit together within a topic. Following New-
man et al. (2009), we extract document co-occurence learned from the topic distribution of training docu-
statistics from Wikipedia and score a topic’s coherence ments applied to test documents (the topic model sees
by averaging the pairwise PMI score (w.r.t. Wikipedia both sets). A higher accuracy means the unsupervised
co-occurence) of the topic’s ten highest ranked words. topic model better captures the underlying structure of
Higher average PMI implies a more coherent topic. the corpus. To better simulate real-world situations, 20-
Classification accuracy is the accuracy of a classifier newsgroup’s test/train split is by date (test documents
1
appeared after training documents).
A collection of daily news items between 1996 to
2000 in English. It contains 9,756 documents, 1,175,526
word tokens, and 20,000 distinct word types. Avail- Comparisons We evaluate the performance of our
able at homepages.inf.ed.ac.uk/pkoehn/publications/ model (infvoc) against three other models with fixed
de-news.
2 vocabularies: online variational Bayes LDA (fixvoc-vb,
A collection of discussions in 20 different newsgroups. It
contains 18,846 documents and 100,000 distinct word types. Hoffman et al. 2010), online hybrid LDA (fixvoc-hybrid,
It is sorted by date into roughly 60% training and 40% test- Mimno et al. 2012), and dynamic topic models (dtm,
ing data. Available at qwone.com/~jason/20Newsgroups. Blei & Lafferty 2006). Including dynamic topic models
Online Latent Dirichlet Allocation with Infinite Vocabulary
80
is not a fair comparison, as its inferences requires access
60
to all of the documents in the dataset; unlike the other
algorithms, it is not online. 40

pmi

settings
20 dtm−dict: tcv=0.01 fixvoc−vb−dict
fixvoc−hybrid−dict fixvoc−vb−null
0 fixvoc−hybrid−null infvoc: α β
ab=2k T=4k U=10
Vocabulary For fixed vocabulary models, we must
0 10 20 30 40
decide on a vocabulary a priori. We consider two factor(minibatch)
different vocabulary methods: use the first minibatch (a) de-news, S = 245, K = 10, κ = 0.6 and τ0 = 64
to define a vocabulary (null ) or use a comprehensive
dictionary3 (dict). We use the same dictionary to train 150

infvoc’s base distribution. 100

pmi
50
Experiment Configuration For all models, we use
0
the same symmetric document Dirichlet prior with 0 10 20 30 40 50 60 70 80 90 100 110 120
αθ = 1/K, where K is the number of topics. Online factor(minibatch)
dtm−dict: tcv=0.05 fixvoc−hybrid−null fixvoc−vb−null
models see exactly the same minibatches. For dtm,

settings
β
fixvoc−hybrid−dict fixvoc−vb−dict infvoc: α
ab=5k T=20k U=20
which is not an online algorithm but instead partitions (b) 20 newsgroups, S = 155, K = 50, κ = 0.8 and τ0 = 64
its input into “epochs”, we combine documents in ten
consecutive minibatches into an epoch (longer epochs Figure 4. PMI score on two datasets against different mod-
tended to have worse performance; this was the shortest els. Our model infvoc yields a better PMI score against
epoch that had reasonable runtime). fixvoc and dtm; gains are more marked in later minibatches
For online hybrid approaches (infvoc and fixvoc-hybrid ), as more and more proper names have been added to the
topics. Because dtm is not an online algorithm, we do not
we collect 10 samples empirically from the variational
have detailed per-minibatch coherence statistics and thus
distribution in E-step with 5 burn-in sweeps. For fixvoc- show topic coherence as a box plot per epoch.
vb, we run 50 iterations for local parameter updates.

6.1. Sensitivity to Parameters select parameters for each of the models4 and plotted
Figure 2 shows how the PMI score is affected by the the topic coherence averaged over all topics in Figure 4.
DP scale parameter αβ , the truncation level T , and the While infvoc initially holds its own against other mod-
reordering delay U . The relatively high values of αβ els, it does better and better in later minibatches, since
may be surprising to readers used to seeing a DP that it has managed to gain a good estimate of the vocabu-
instantiates dozens of atoms, but when vocabularies lary and the topic distributions have stabilized. Most
are in tens of thousands, such scale parameters are of the gains in topic coherence come from highly spe-
necessary to support the long tail. Although we did cific proper nouns which are missing from vocabularies
not investigate such approaches, this suggests that more of the fixed-vocabulary topic models. This advantage
advanced nonparametric distributions (Teh, 2006) or holds even against dtm, which uses batch inference.
explicitly optimizing αβ may be useful. Relatively large
values of U suggest that accurate estimates of the rank 6.3. Comparing Algorithms: Classification
order are important for maintaining coherent topics.
For the classification comparison, we consider addi-
While infvoc is sensitive to parameters related to the
tional topic models. While we need the most probable
vocabulary, once suitable values of those parameters
topic strings for PMI calculations, classification exper-
are chosen, it is no more sensitive to learning-specific
iments only need a document’s topic vector. Thus, we
parameters than other online LDA algorithms (Fig-
consider hashed vocabulary schemes. The first, which
ure 3), and values used for other online topic models
we call dict-hashing, uses a dictionary for the known
also work well here.
words and hashes any other words into the same set
4
6.2. Comparing Algorithms: Coherence For the de-news dataset, we select (20 newsgroups pa-
rameters in parentheses) minibatch size S ∈ {140, 245}
Now that we have some idea of how we should set (S ∈ {155, 310}), DP scale parameter αβ ∈ {1k, 2k}
parameters for infvoc, we compare it against other (αβ ∈ {3k, 4k, 5k}), truncation size T ∈ {3k, 4k} (T ∈
topic modeling techniques. We used grid search to {20k, 30k, 40k}), reordering delay U ∈ {10, 20} for infvoc;
and topic chain variable tcv ∈ {0.001, 0.005, 0.01, 0.05} for
3
http://sil.org/linguistics/wordlists/english/ dtm.
Online Latent Dirichlet Allocation with Infinite Vocabulary

model settings accuracy % 6.4. Qualitative Example


infvoc αβ = 3k T = 40k U = 10 52.683
fixvoc vb-dict 45.514 Figure 1 shows the evolution of a topic in 20 news-
τ0 = 64 κ = 0.6

fixvoc vb-null 49.390 groups about comics as new vocabulary words enter
fixvoc hybrid-dict 46.720
S = 155

from new minibatches. While topics improve over time


fixvoc hybrid-null 50.474
fixvoc vb dict-hash 52.525 (e.g., relevant words like “seri(es)”, “issu(e)”, “forc(e)”
fixvoc vb full-hash T = 30k 51.653 are ranked higher), interesting words are being added
fixvoc hybrid dict-hash 50.948 throughout training and become prominent after later
fixvoc hybrid full-hash T = 30k 50.948 minibatches are processed (e.g., “captain”, “comic-
dtm-dict tcv = 0.001 62.845 strip”, “mutant”). This is not the case for standard
infvoc αβ = 3k T = 40k U = 20 52.317
online LDA—these words are ignored and the model
fixvoc vb-dict 44.701
τ0 = 64 κ = 0.6

fixvoc vb-null 51.815 does not capture such information. In addition, only
fixvoc hybrid-dict 46.368 about 60% of the word types appeared in the SIL En-
S = 310

fixvoc hybrid-null 50.569 glish dictionary. Even with a comprehensive English


fixvoc vb dict-hash 48.130 dictionary, online LDA could not capture all the word
fixvoc vb full-hash T = 30k 47.276
types in the corpus, especially named entities.
fixvoc hybrid dict-hash 51.558
fixvoc hybrid full-hash T = 30k 43.008
dtm-dict tcv = 0.001 64.186 7. Conclusion and Future Work
Table 1. Classification accuracy based on 50 topic features We proposed an online topic model that, instead of
extracted from 20 newsgroups data. Our model (infvoc) out- assuming vocabulary is known a priori, adds and sheds
performs algorithms with a fixed or hashed vocabulary but words over time. While our model is better able to
not dtm, a batch algorithm that has access to all documents. create coherent topics, it does not outperform dynamic
topic models (Blei & Lafferty, 2006; Wang et al., 2008)
that explicitly model how topics change. It would
be interesting to allow such models to—in addition
to modeling the change of topics—also change the
of integers. The second, full-hash, used in Vowpal underlying dimensionality of the vocabulary.
Wabbit,5 hashes all words into a set of T integers.
In addition to explicitly modeling the change of topics
We train 50 topics for all models on the entire dataset over time, it is also possible to model additional struc-
and collect the document level topic distribution for ev- ture within topic. Rather than a fixed, immutable base
ery article. We treat such statistics as features and train distribution, modeling each topic with a hierarchical
a SVM classifier on all training data using Weka (Hall character n-gram model would capture regularities in
et al., 2009) with default parameters. We then use the the corpus that would, for example, allow certain top-
classifier to label testing documents with one of the 20 ics to favor different orthographies (e.g., a technology
newsgroup labels. A higher accuracy means the model topic might prefer words that start with “i”). While
is better capturing the underlying content. some topic models have attempted to capture orthogra-
Our model infvoc captures better topic features than phy for multilingual applications (Boyd-Graber & Blei,
online LDA fixvoc (Table 1) under all settings.6 This 2009), our approach is more robust and incorporating
suggests that in a streaming setting, infvoc can better the our approach with models of transliteration (Knight
categorize documents. However, the batch algorithm & Graehl, 1997) might allow concepts expressed in one
dtm, which has access to the entire dataset performs language better capture concepts in another, further
better because it can use later documents to retrospec- improving the ability of algorithms to capture the evolv-
tively improve its understanding of earlier ones. Unlike ing themes and topics in large, streaming datasets.
dtm, infvoc only sees early minibatches once and cannot
revise its model when it is tested on later minibatches. Acknowledgments
5
hunch.net/~vw/
6 The authors thank Chong Wang, Dave Blei, and Matt
Parameters were chosen via cross-validation on a
30%/70% dev-test split from the following parameter set- Hoffman for answering questions and sharing code. We
tings: DP scale parameter α ∈ {2k, 3k, 4k}, reordering thank Jimmy Lin and the anonymous reviewers for
delay U ∈ {10, 20} (for infvoc only); truncation level T ∈ helpful suggestions. Research supported by NSF grant
{20k, 30k, 40k} (for infvoc and fixvoc full-hash models); step #1018625. Any opinions, conclusions, or recommenda-
decay factors τ0 ∈ {64, 256} and κ ∈ {0.6, 0.7, 0.8, 0.9, 1.0}
tions are the authors’ and not those of the sponsors.
(for all online models); and topic chain variable tcv ∈
{0.01, 0.05, 0.1, 0.5} (for dtm only).
Online Latent Dirichlet Allocation with Infinite Vocabulary

References Kurihara, Kenichi, Welling, Max, and Teh, Yee Whye.


Collapsed variational Dirichlet process mixture models.
Algeo, John. Where do all the new words come from? In IJCAI. 2007.
American Speech, 55(4):264–277, 1980.
Mimno, David, Hoffman, Matthew, and Blei, David. Sparse
Bird, Steven, Klein, Ewan, and Loper, Edward. Natural stochastic inference for latent Dirichlet allocation. In
Language Processing with Python. O’Reilly Media, 2009. ICML, 2012.
Blei, David M. and Jordan, Michael I. Variational infer- Müller, Peter and Quintana, Fernando A. Nonparametric
ence for Dirichlet process mixtures. Journal of Bayesian Bayesian data analysis. Statistical Science, 19(1):95–110,
Analysis, 1(1):121–144, 2005. 2004.
Blei, David M. and Lafferty, John D. Dynamic topic models. Neal, Radford M. Probabilistic inference using Markov
In ICML, 2006. chain Monte Carlo methods. Technical Report CRG-TR-
93-1, University of Toronto, 1993.
Blei, David M., Ng, Andrew, and Jordan, Michael. Latent
Dirichlet allocation. JMLR, 3:993–1022, 2003. Newman, David, Karimi, Sarvnaz, and Cavedon, Lawrence.
External evaluation of topic models. In ADCS, 2009.
Blunsom, Phil and Cohn, Trevor. A hierarchical Pitman-Yor
process HMM for unsupervised part of speech induction. Paul, Michael and Girju, Roxana. A two-dimensional topic-
In ACL, 2011. aspect model for discovering multi-faceted topics. 2010.

Boyd-Graber, Jordan and Blei, David M. Multilingual topic Sato, Masa-Aki. Online model selection based on the varia-
models for unaligned text. In UAI, 2009. tional Bayes. Neural Computation, 13(7):1649–1681, July
2001.
Chang, Jonathan, Boyd-Graber, Jordan, and Blei, David M.
Connections between the lines: Augmenting social net- Sethuraman, Jayaram. A constructive definition of Dirichlet
works with text. In KDD, 2009. priors. Statistica Sinica, 4:639–650, 1994.

Teh, Yee Whye. A hierarchical Bayesian language model


Clark, Alexander. Combining distributional and morpho-
based on Pitman-Yor processes. In ACL, 2006.
logical information for part of speech induction. 2003.
Wang, Chong and Blei, David. Variational inference for the
Cohen, Shay B., Blei, David M., and Smith, Noah A. Vari- nested Chinese restaruant process. In NIPS, 2009.
ational inference for adaptor grammars. In NAACL,
2010. Wang, Chong and Blei, David M. Truncation-free online
variational inference for bayesian nonparametric models.
Dietz, Laura, Bickel, Steffen, and Scheffer, Tobias. Un- In NIPS, 2012.
supervised prediction of citation influences. In ICML,
2007. Wang, Chong, Blei, David M., and Heckerman, David.
Continuous time dynamic topic models. In UAI, 2008.
Ferguson, Thomas S. A Bayesian analysis of some nonpara-
metric problems. The Annals of Statistics, 1(2):209–230, Wang, Chong, Paisley, John, and Blei, David. Online
1973. variational inference for the hierarchical Dirichlet process.
In AISTATS, 2011.
Goldwater, Sharon and Griffiths, Thomas L. A fully
Bayesian approach to unsupervised part-of-speech tag- Wei, Xing and Croft, Bruce. LDA-based document models
ging. In ACL, 2007. for ad-hoc retrieval. In SIGIR, 2006.

Hall, Mark, Frank, Eibe, Holmes, Geoffrey, Pfahringer, Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A.,
Bernhard, Reutemann, Peter, and Witten, Ian H. The and Attenberg, J. Feature hashing for large scale multi-
WEKA data mining software: An update. SIGKDD task learning. In ICML, pp. 1113–1120. ACM, 2009.
Explorations, 11, 2009.
Zhai, Ke, Boyd-Graber, Jordan, Asadi, Nima, and Alkhouja,
Hoffman, Matthew, Blei, David M., and Bach, Francis. Mohamad. Mr. LDA: A flexible large scale topic modeling
Online learning for latent Dirichlet allocation. In NIPS, package using variational inference in mapreduce. In
2010. WWW, 2012.

Jelinek, F. and Mercer, R. Probability distribution estima-


tion from sparse data. IBM Technical Disclosure Bulletin,
28:2591–2594, 1985.

Knight, Kevin and Graehl, Jonathan. Machine translitera-


tion. In ACL, 1997.

Kurihara, Kenichi, Welling, Max, and Vlassis, Nikos. Accel-


erated variational Dirichlet process mixtures. In NIPS,
2006.

You might also like