You are on page 1of 53

Fron%ers

of Computa%onal Journalism
Columbia Journalism School Week 2: Clustering September 17, 2012

Classica%on and Clustering


Classica%on is arguably one of the most central and generic of all our conceptual exercises. It is the founda%on not only for conceptualiza%on, language, and speech, but also for mathema%cs, sta%s%cs, and data analysis in general.
- Kenneth D. Bailey, Typologies and Taxonomies: An Introduc7on to
Classica7on Techniques

Week 2: Clustering
Vector representa%on of objects Distance Metrics Clustering Algorithms Editorial Choice

Week 2: Clustering
Vector representa%on of objects Distance Metrics Clustering Algorithms Editorial Choice

Vector representa%on of objects


Fundamental representa%on for (almost) all data mining, clustering, machine learning, visualiza%on, NLP, etc. algorithms.
! # # # # # # # " x1 $ & x2 & & x3 & & & xN & %

! # # # # # # # "

x1 $ & x2 & & x3 & & & xN & %

Each xi is a numerical or categorical feature N = number of features or dimension

Examples of features
number of claws la%tude color {red, yellow, blue} number of break-ins 1 for bought X, 0 for did not buy X %me, dura%on, etc. number of %mes word Y appears in document votes cast

Feature selec%on
Technical meaning in machine learning etc.: which variables ma.er? Were journalists, so were interested in an earlier process: how to describe the world in numbers?

Choosing Features
! # # # # # # # "
Journalism How do we represent the world numerically?

x1 $ & x2 & & x3 & & & xN & %

! x # f (1) # x f (2) # # # x f (k ) "

$ & & & & & %

where k N

Machine learning Which variables carry the most informa%on?

Dierent types of quan%ta%ve


Numeric
con%nuous countable bounded? units of measurement?

Categorical
nite, e.g. {on, o} innite e.g. {red, yellow, blue, ... chartreuse} ordered? equivalence classes or other structure?

Dierent types of scales


Temperature Con%nuous scale, xed zero point, physical units, compara%ve, uniform

Likert Scale Discrete scale, no xed origin , abstract units, compara%ve, non-uniform

Likert scales are non-uniform

No averages on a non-uniform scale


Its not linear, so is 2X1 twice as good? (X1+c) (X2+c) X1 X2 Lots of things dont make much sense, such as sum(X1 ... XN) / N = ? Average is not well dened! (Nor std dev, etc.) But rank order sta%s%cs are robust. And all of this might not be a problem in prac%ce.

Other issues withquan%ta%ve


Where did the data come from?
physical measurement computer logging human recording

What are the sources of error?


measurement error missing data ambiguity in human classica%on process errors inten%onal bias / decep%on

! # # # # # # # "

x1 $ & x2 & & x3 & & & xN & %

Even with all these caveats, the vector representa%on is incredibly exible and powerful.

Examples of vector representa%ons


Obvious
movies watched / items purchased Legisla%ve vo%ng history for a poli%cian crime loca%ons

Less obvious, but standard


document vector space model psychological survey results

Tricky research problem: disparate eld types


Corporate ling document Wikileaks SIGACT

What can we do with vectors?


Predict one variable based on others
this is called regression supervised machine learning

Group similar items together


This is classica%on or clustering We may or may not know pre-exis%ng classes

Week 2: Clustering
Vector representa%on of objects Distance Metrics Clustering Algorithms Editorial Choice

Distance metric
Intui%vely: how (dis)similar are two items? Formally: d(x, y) 0 d(x, x) = 0 d(x, y) = d(y, x) d(x, z) d(x, y) + d(y, z)

Distance metric
d(x, y) 0
- distance is never nega%ve

d(x, x) = 0
- reexivity: zero distance to self

d(x, y) = d(y, x)
- symmetry: x to y same as y to x

d(x, z) d(x, y) + d(y, z)


- triangle inequality: going direct is shorter

Distance matrix
Data matrix for M objects of N dimensions

! x1 $ ! x1,1 # & # # x2 & = # x2,1 X= # & # # & # " xM % # x1,M "

x1,2 x2,2

x1,N $ & & & & xM ,N & %

Distance matrix
! d # 1,1 # d2,1 Dij = D ji = d(xi , x j ) = # # # d1,M " d1,2 dM ,M $ & & d2,2 & & dM ,M & %

Week 1: Basics
Vector representa%on of objects Distance Metrics Clustering Algorithms Editorial Choice

We think of a cluster like this

Real data isnt so simple

Many possible deni%ons of a cluster

Many possible deni%ons of a cluster


every point inside is closer to the center of this cluster than the center of any other no point outside this cluster is closer than to any point inside every point in this cluster is closer to all points inside than any point outside

Dierent clustering algorithms


Par%%oning
keep adjus%ng clusters un%l convergence e.g. K-means

Agglomera%ve hierarchical
start with leaves, repeatedly merge clusters e.g. MIN and MAX approaches

Divisive hierarchical
start with root, repeatedly split clusters e.g. binary split

K-means demo

hqp://www.paused21.net/o/kmeans/bin/

Agglomera%ve combining clusters



put each item into a leaf node while num clusters > 1 find two closest clusters merge them

single link or min

complete link or max

average

0 LDem Bp LDem Con LDem Lab Con LDem XB LDem Con LDem XB Con LDem Con LDem Con LDem Con XB Con LDem Con LDem Con LDem LDem Con Con LDem Con Con LDem Con LDem Con XB Con Con Con LDem XB LDem Con Con LDem UUP Con XB Con Con Con LDem Con Con XB Con Con LDem XB Con UKIP XB Con Con Con LDem XB Con LDem Con XB LDem Con Con Con Con LDem XB LDem Con Con LDem Con LDem Con LDem Con XB LDem Con Con Con LDem XB Con LDem Con LDem XB Con LDem Bp XB Lab XB Lab XB Bp DUP XB Lab DUP Lab XB Lab Bp XB Lab Other XB Ind Lab Bp XB Lab Lab Bp Lab Lab XB Lab XB Lab Lab Lab XB Lab Lab XB Lab Lab Lab Other Lab Lab XB Lab Lab Bp XB Lab XB Bp Lab XB XB Lab Lab XB Lab PC XB Lab Lab Lab Lab XB Lab XB Lab Lab Lab Lab Lab Lab Lab XB Lab Lab Lab XB Lab XB Lab Con XB XB Con XB Con XB XB Con Lab UKIP XB LDem XB Con XB Bp XB DUP Bp XB Lab Bp XB DUP Lab XB Bp XB LDem XB XB Con XB Bp XB Bp LDem XB Con Bp XB Lab Bp XB XB LDem Bp XB

House of Lords Clustering Demo

50

100

150

200

Another approach: visualiza%on


Let the humans nd the clusters. Problem: data is in N dimensions

Dimensionality reduc%on
Given {x} RN project to {y} RK<<N Probably K=2 or K=3. Want a good projec%on that preserves separa%on between clusters.

Dimensionality reduc%on

Principal components analysis: nd a linear projec%on that preserves greatest variance


(take rst K eigenvectors of covariance matrix corresponding to largest eigenvalues, blah blah blah)

Linear projec%ons: unavoidable overlap

Mul%dimensional scaling
Idea: try to preserve distances between clusters. Given {xi} RN and a distance matrix D = |xi xj| for all i,j We can recover the {xi} coordinates exactly (up to rigid transforma%ons)

Mul%dimensional scaling
Torgerson's "classical MDS" algorithm (1952)

Reducing dimension with MDS


No%ce: dimension N is not encoded in the distance matrix D (its M by M where M is number of points) MDS formula (theore%cally) allows us to recover point coordinates {x} in any number of dimensions k.

MDS Stress minimiza%on


Classic formula minimizes stress

stress(x) = xi x j dij
i, j

Think of springs between every pair of points. Spring between xi, xj has rest length dij Stress is zero if all high-dimensional distances matched exactly in low dimension.

House of Lords MDS Demo

Week 2: Clustering
Vector representa%on of objects Distance Metrics Clustering Algorithms Editorial Choice

There is no right categoriza%on


Should we sort incident reports by loca%on, %me, actor, event type, author, cost, casual%es? There is only context-specic categoriza%on. And the computer doesnt understand your context.

Dierent libraries, dierent categories

Visualize all possible clusterings!

Grimmer and King, 2009

We can look for interes%ng clusterings


A good plan, but: 1) Not a neutral or objec%ve act. Interes%ng denes a journalis%c frame and has associated poli%cs. (Why is this interes%ng and to whom?) 2) Danger of tweaking the algorithm un%l the result matches our preconcep%ons.

Robustness of results
If we see the same paqern using many dierent techniques, its probably real
but there can s%ll be major interpre%ve errors what does the paqern mean and how do we know were even looking at the right objects?

Example: U.S. Congressional Vo%ng

DW-NOMINATE algorithm by Poole, Rosenthal Essen%ally a variant of MDS

Example: U.S. Congressional Vo%ng


112th U.S. House of Representatives
20 30 40 50 60 70 80 90
IDEAL POINT ESTIMATES AS OF 05:24 Sat Sep 15 2012 PDT BASED ON ANALYSIS OF 1458 NONUNANIMOUS ROLLCALLS.
Pompeo (R KS4) Boehner (R OH8) NOT SHOWN. CURVES ARE LOCAL LINEAR REGRESSIONS (LOESS, SPAN=2/3, OUTLIER ROBUST) Chabot (R OH1)

ROLL CALL DATA FROM CLERK OF THE HOUSE

Republicans Democrats

Estimated Ideal Point

Jones (R NC3)

Costa (D CA20) McGovern (D MA3)


Filner (D CA51)

2 20 30 40 50 60 70 80 90

Bayesian es%mator by Simon Jackman

Obama 2008 Vote in District (%)

Robustness of results
Regarding these analyses of congressional vo%ng, we could s%ll ask: Are we modeling the right thing? (What about other legisla%ve work, e.g. in commiqee?) Are our underlying assump%ons correct? (do representa%ves really have ideal points in a preference space?) What are we trying to argue? What will be the eect of poin%ng out this result?

Editorial choice in
object selec%on object encoding distance func%on design clustering algorithm visualiza%on algorithm variables to correlate against visualiza%on design story lede presenta%on

You might also like