Professional Documents
Culture Documents
– data cleaning
– removing duplicates
slide 3
Identification of Duplicates Given Name, Address, Age
Matching Information
Name Address Age
slide 5
Problem of record linkage
problem - quickly and accurately determining if pairs of
records describe the same entity, but unique IDs to bring
together the matching records are lacking
slide 6
Files A & B, record a in A & record b in B
File A File B
matching matching
variables variables
v1 ... v K X Y w1 ... w K
a
slide 7
Methodology of record linkage
slide 8
Deterministic linkage
problems
– often no unique, known and accurate ID
– missing values & partial agreements common
slide 11
Data problems
typos/mispelling
incomplete words
extraneous information
abbreviations
– data cleaning
– processing data fields to recognize similarity
– duplicate
– match (link)
post-linkage
Input Match
DB A
Possible
Search space Search Space Decision Model - match
Reduction C!A x B Application
AxB
Non-match
Input
DB B
slide 15
Hypothetical example
NA = 10 ambulance cases
http://injuryprevention.bmj.com/content/10/3/186.full.pdf
slide 16
Prior probability of a match
NX 1 9 1
Pr(match) = × = × = 0.045
NA NB 10 20
slide 24
U-probability (Unmatch probability)
U-probability: probability that a field agrees given
that the pair of records is NOT a true match
slide 26
U-probability (Unmatch probability) . . .
slide 28
U-probability (Unmatch probability) . . .
600 Andersons
30 Rumplestilskins
estimated U-probability
slide 29
Estimating match (probabilistic) weights
for a given field with match probability M and
unmatch probability U
slide 31
Fellegi-Sunter model
no-decision region
(hold for human review)
designate as designate as
definite non-match definite match
* true matches
! true non-matches
sim(a, b)
Posterior odds and posterior probabilities
slide 33
Posterior odds and posterior probabilities . . .
posterior odds
posterior probability =
1 + posterior odds
for pair A10-E19, the posterior probability is .9999
slide 36
Probabilistic linkage . . .
slide 37
Probabilistic linkage . . .
designate as designate as
definite non-match definite match
* true matches
! true non-matches
sim(a, b)
Deterministic versus probabilistic methods
studies using human review or artificially withheld
identifying information as ‘gold standard’
Gomatam S et al (2002) An empirical comparison of record
linkage procedures, Statistics in Medicine, 21, 1485-1496
http://nisla05.niss.org/dgii/presentations/gomatam-sic-
200305.pdf
Roos LL, Walld R, Wajda A, Bond R, Hartford K (1996)
Record linkage strategies, outpatient procedures, and
administrative data. Medical Care 34, 570–582
Jamieson E, Roberts J, Browne G (1995) The feasibility and
accuracy of anonymized record linkage to estimate shared
clientele among three health and social service agencies.
Methods Inf Med. 34, 371–377 slide 41
RELAIS
JAVA based
http://www.istat.it/strumenti/metodi/software/
analisi dati/relais/
slide 42
References
Herzog TN, Scheuren FJ & Winkler WE (2007).
Data quality and record linkage techniques. New
York: Springer. 234 pp. Part IV discusses software.
Howe GR (1998) Use of computerized record linkage in
cohort studies. Epidemiologic Reviews 20, 112–121
Krewski D, Dewanji A, Wang Y, Bartlett S, Zielinski JM &
Mallick R (2005) The effect of record linkage errors on risk
estimates in cohort mortality studies. Survey Methodology
31,13–21
Brenner H, Schmidtmann I & Stegmaier C (1997) Effect of
record linkage errors on registry-based follow-up studies.
Statistics in Medicine 16, 2633–2643
slide 43