Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 2

Fron%ers
of Computa%onal Journalism
Columbia Journalism School Week 2: Clustering September 17, 2012
Classica%on and Clustering

Classica%on is arguably one of the most central and generic of all our conceptual exercises. It is the founda%on not only for conceptualiza%on, language, and speech, but also for mathema%cs, sta%s%cs, and data analysis in general.
- Kenneth D. Bailey, Typologies and Taxonomies: An Introduc7on to
Classica7on Techniques
Week 2: Clustering
Vector representa%on of objects Distance Metrics Clustering Algorithms Editorial Choice
Week 2: Clustering
Vector representa%on of objects

Fundamental representa%on for (almost) all data mining, clustering, machine learning, visualiza%on, NLP, etc. algorithms.
! # # # # # # # " x1 $ & x2 & & x3 & & & xN & %
! # # # # # # # "
x1 $ & x2 & & x3 & & & xN & %
Each xi is a numerical or categorical feature N = number of features or dimension
Examples of features
number of claws la%tude color {red, yellow, blue} number of break-ins 1 for bought X, 0 for did not buy X %me, dura%on, etc. number of %mes word Y appears in document votes cast
Feature selec%on
Technical meaning in machine learning etc.: which variables ma.er? Were journalists, so were interested in an earlier process: how to describe the world in numbers?
Choosing Features
! # # # # # # # "
Journalism How do we represent the world numerically?
x1 $ & x2 & & x3 & & & xN & %
! x # f (1) # x f (2) # # # x f (k ) "
$ & & & & & %
where k N
Machine learning Which variables carry the most informa%on?
Dierent types of quan%ta%ve

Numeric
con%nuous countable bounded? units of measurement?
Categorical
nite, e.g. {on, o} innite e.g. {red, yellow, blue, ... chartreuse} ordered? equivalence classes or other structure?
Dierent types of scales

Temperature Con%nuous scale, xed zero point, physical units, compara%ve, uniform
Likert Scale Discrete scale, no xed origin , abstract units, compara%ve, non-uniform
Likert scales are non-uniform
No averages on a non-uniform scale

Its not linear, so is 2X1 twice as good? (X1+c) (X2+c) X1 X2 Lots of things dont make much sense, such as sum(X1 ... XN) / N = ? Average is not well dened! (Nor std dev, etc.) But rank order sta%s%cs are robust. And all of this might not be a problem in prac%ce.
Other issues withquan%ta%ve

Where did the data come from?
physical measurement computer logging human recording
What are the sources of error?

measurement error missing data ambiguity in human classica%on process errors inten%onal bias / decep%on
! # # # # # # # "
x1 $ & x2 & & x3 & & & xN & %
Even with all these caveats, the vector representa%on is incredibly exible and powerful.
Examples of vector representa%ons

Obvious
movies watched / items purchased Legisla%ve vo%ng history for a poli%cian crime loca%ons
Less obvious, but standard

document vector space model psychological survey results
Tricky research problem: disparate eld types

Corporate ling document Wikileaks SIGACT
What can we do with vectors?

Predict one variable based on others
this is called regression supervised machine learning
Group similar items together

This is classica%on or clustering We may or may not know pre-exis%ng classes
Week 2: Clustering
Distance metric
Intui%vely: how (dis)similar are two items? Formally: d(x, y) 0 d(x, x) = 0 d(x, y) = d(y, x) d(x, z) d(x, y) + d(y, z)
Distance metric
d(x, y) 0
- distance is never nega%ve
d(x, x) = 0
- reexivity: zero distance to self
d(x, y) = d(y, x)
- symmetry: x to y same as y to x
d(x, z) d(x, y) + d(y, z)

- triangle inequality: going direct is shorter
Distance matrix
Data matrix for M objects of N dimensions
! x1 $ ! x1,1 # & # # x2 & = # x2,1 X= # & # # & # " xM % # x1,M "
x1,2 x2,2
x1,N $ & & & & xM ,N & %
Distance matrix
! d # 1,1 # d2,1 Dij = D ji = d(xi , x j ) = # # # d1,M " d1,2 dM ,M $ & & d2,2 & & dM ,M & %
Week 1: Basics
We think of a cluster like this
Real data isnt so simple
Many possible deni%ons of a cluster
Many possible deni%ons of a cluster

every point inside is closer to the center of this cluster than the center of any other no point outside this cluster is closer than to any point inside every point in this cluster is closer to all points inside than any point outside
Dierent clustering algorithms

Par%%oning
keep adjus%ng clusters un%l convergence e.g. K-means
Agglomera%ve hierarchical
start with leaves, repeatedly merge clusters e.g. MIN and MAX approaches
Divisive hierarchical
start with root, repeatedly split clusters e.g. binary split
K-means demo
hqp://www.paused21.net/o/kmeans/bin/
Agglomera%ve combining clusters

put each item into a leaf node while num clusters > 1 find two closest clusters merge them
single link or min
complete link or max
average
0 LDem Bp LDem Con LDem Lab Con LDem XB LDem Con LDem XB Con LDem Con LDem Con LDem Con XB Con LDem Con LDem Con LDem LDem Con Con LDem Con Con LDem Con LDem Con XB Con Con Con LDem XB LDem Con Con LDem UUP Con XB Con Con Con LDem Con Con XB Con Con LDem XB Con UKIP XB Con Con Con LDem XB Con LDem Con XB LDem Con Con Con Con LDem XB LDem Con Con LDem Con LDem Con LDem Con XB LDem Con Con Con LDem XB Con LDem Con LDem XB Con LDem Bp XB Lab XB Lab XB Bp DUP XB Lab DUP Lab XB Lab Bp XB Lab Other XB Ind Lab Bp XB Lab Lab Bp Lab Lab XB Lab XB Lab Lab Lab XB Lab Lab XB Lab Lab Lab Other Lab Lab XB Lab Lab Bp XB Lab XB Bp Lab XB XB Lab Lab XB Lab PC XB Lab Lab Lab Lab XB Lab XB Lab Lab Lab Lab Lab Lab Lab XB Lab Lab Lab XB Lab XB Lab Con XB XB Con XB Con XB XB Con Lab UKIP XB LDem XB Con XB Bp XB DUP Bp XB Lab Bp XB DUP Lab XB Bp XB LDem XB XB Con XB Bp XB Bp LDem XB Con Bp XB Lab Bp XB XB LDem Bp XB
House of Lords Clustering Demo
50
100
150
200
Another approach: visualiza%on

Let the humans nd the clusters. Problem: data is in N dimensions
Dimensionality reduc%on
Given {x} RN project to {y} RK<<N Probably K=2 or K=3. Want a good projec%on that preserves separa%on between clusters.
Dimensionality reduc%on
Principal components analysis: nd a linear projec%on that preserves greatest variance

(take rst K eigenvectors of covariance matrix corresponding to largest eigenvalues, blah blah blah)
Linear projec%ons: unavoidable overlap
Mul%dimensional scaling
Idea: try to preserve distances between clusters. Given {xi} RN and a distance matrix D = |xi xj| for all i,j We can recover the {xi} coordinates exactly (up to rigid transforma%ons)

Mul%dimensional scaling
Torgerson's "classical MDS" algorithm (1952)

Reducing dimension with MDS

No%ce: dimension N is not encoded in the distance matrix D (its M by M where M is number of points) MDS formula (theore%cally) allows us to recover point coordinates {x} in any number of dimensions k.

MDS Stress minimiza%on

Classic formula minimizes stress
stress(x) = xi x j dij
i, j
Think of springs between every pair of points. Spring between xi, xj has rest length dij Stress is zero if all high-dimensional distances matched exactly in low dimension.
House of Lords MDS Demo
Week 2: Clustering
There is no right categoriza%on

Should we sort incident reports by loca%on, %me, actor, event type, author, cost, casual%es? There is only context-specic categoriza%on. And the computer doesnt understand your context.
Dierent libraries, dierent categories
Visualize all possible clusterings!
Grimmer and King, 2009
We can look for interes%ng clusterings

A good plan, but: 1) Not a neutral or objec%ve act. Interes%ng denes a journalis%c frame and has associated poli%cs. (Why is this interes%ng and to whom?) 2) Danger of tweaking the algorithm un%l the result matches our preconcep%ons.
Robustness of results
If we see the same paqern using many dierent techniques, its probably real
but there can s%ll be major interpre%ve errors what does the paqern mean and how do we know were even looking at the right objects?
Example: U.S. Congressional Vo%ng
DW-NOMINATE algorithm by Poole, Rosenthal Essen%ally a variant of MDS
Example: U.S. Congressional Vo%ng

112th U.S. House of Representatives
20 30 40 50 60 70 80 90
IDEAL POINT ESTIMATES AS OF 05:24 Sat Sep 15 2012 PDT BASED ON ANALYSIS OF 1458 NONUNANIMOUS ROLLCALLS.
Pompeo (R KS4) Boehner (R OH8) NOT SHOWN. CURVES ARE LOCAL LINEAR REGRESSIONS (LOESS, SPAN=2/3, OUTLIER ROBUST) Chabot (R OH1)
ROLL CALL DATA FROM CLERK OF THE HOUSE
Republicans Democrats
Estimated Ideal Point
Jones (R NC3)
Costa (D CA20) McGovern (D MA3)

Filner (D CA51)
2 20 30 40 50 60 70 80 90
Bayesian es%mator by Simon Jackman
Obama 2008 Vote in District (%)
Robustness of results
Regarding these analyses of congressional vo%ng, we could s%ll ask: Are we modeling the right thing? (What about other legisla%ve work, e.g. in commiqee?) Are our underlying assump%ons correct? (do representa%ves really have ideal points in a preference space?) What are we trying to argue? What will be the eect of poin%ng out this result?
Editorial choice in
object selec%on object encoding distance func%on design clustering algorithm visualiza%on algorithm variables to correlate against visualiza%on design story lede presenta%on

Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 2

Uploaded by

Copyright:

Available Formats

Fron%ers

Classica%on and Clustering

Vector representa%on of objects

x1 $ & x2 & & x3 & & & xN & %

Each xi is a numerical or categorical feature N = number of features or dimension

x1 $ & x2 & & x3 & & & xN & %

! x # f (1) # x f (2) # # # x f (k ) "

$ & & & & & %

Machine learning Which variables carry the most informa%on?

Dierent types of quan%ta%ve

Dierent types of scales

Likert scales are non-uniform

No averages on a non-uniform scale

Other issues withquan%ta%ve

What are the sources of error?

x1 $ & x2 & & x3 & & & xN & %

Examples of vector representa%ons

Less obvious, but standard

Tricky research problem: disparate eld types

What can we do with vectors?

Group similar items together

d(x, z) d(x, y) + d(y, z)

! x1 $ ! x1,1 # & # # x2 & = # x2,1 X= # & # # & # " xM % # x1,M "

x1,N $ & & & & xM ,N & %

We think of a cluster like this

Real data isnt so simple

Many possible deni%ons of a cluster

Many possible deni%ons of a cluster

Dierent clustering algorithms

Agglomera%ve combining clusters

single link or min

complete link or max

House of Lords Clustering Demo

Another approach: visualiza%on

Principal components analysis: nd a linear projec%on that preserves greatest variance

Linear projec%ons: unavoidable overlap

Reducing dimension with MDS

MDS Stress minimiza%on

House of Lords MDS Demo

There is no right categoriza%on

Dierent libraries, dierent categories

Visualize all possible clusterings!

Grimmer and King, 2009

We can look for interes%ng clusterings

Example: U.S. Congressional Vo%ng

DW-NOMINATE algorithm by Poole, Rosenthal Essen%ally a variant of MDS

Example: U.S. Congressional Vo%ng

ROLL CALL DATA FROM CLERK OF THE HOUSE

Estimated Ideal Point

Costa (D CA20) McGovern (D MA3)

Bayesian es%mator by Simon Jackman

Obama 2008 Vote in District (%)

You might also like