Professional Documents
Culture Documents
Information Retrieval
Jun.Prof. Alexander Markowetz Slides modified from Christopher Manning and Prabhakar Raghavan
Sec. 3.1
Hashes
Each vocabulary term is hashed to an integer
(We assume youve seen hashtables before)
Pros:
Lookup is faster than for a tree: O(1)
Cons:
No easy way to find minor variants:
judgment/judgement
No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything
Sec. 3.1
a-hu
hy-m
n-sh
si-z
huy g
ard
sic k
zyg ot
var k
ens
le
Sec. 3.1
Tree: B-tree
a-hu n-z hy-m
Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].
5
Sec. 3.1
Trees
Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings but we standardly have one Pros: Solves the prefix problem (terms starting with hyp) Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive
But B-trees mitigate the rebalancing problem
WILD-CARD QUERIES
Sec. 3.2
Wild-card queries: *
mon*: find all docs containing any word beginning mon. Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon w < moo *mon: find words ending in mon: harder
Maintain an additional B-tree for terms backwards.
Sec. 3.2
Query processing
At this point, we have an enumeration of all terms in the dictionary that match the wildcard query. We still have to look up the postings for each enumerated term. E.g., consider the query: se*ate AND fil*er This may result in the execution of many Boolean AND queries.
Sec. 3.2
We could look up co* AND *tion in a B-tree and intersect the two term sets
Expensive
The solution: transform wild-card queries so that the *s occur at the end This gives rise to the Permuterm Index.
10
Sec. 3.2.1
Permuterm index
For term hello, index under:
hello$, ello$h, llo$he, lo$hel, o$hell
Sec. 3.2.1
12
Sec. 3.2.2
Maintain a second inverted index from bigrams to dictionary terms that match each bigram.
13
Sec. 3.2.2
14
Sec. 3.2.2
Processing wild-cards
Query mon* can now be run as
$m AND mo AND on
Gets terms that match AND version of our wildcard query. But wed enumerate moon. Must post-filter these terms against query. Surviving enumerated terms are then looked up in the term-document inverted index. Fast, space efficient (compared to permuterm).
15
Sec. 3.2.2
Search
Type your search terms, use * if you need to. E.g., Alex* will match Alexander.
16
SPELLING CORRECTION
17
Sec. 3.3
Spell correction
Two principal uses
Correcting document(s) being indexed Correcting user queries to retrieve right answers
Context-sensitive
Look at surrounding words, e.g., I flew form Heathrow to Narita.
18
Sec. 3.3
Document correction
Especially needed for OCRed documents
Correction algorithms are tuned for this: rn/m Can use domain-specific knowledge
E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).
But also: web pages and even printed material has typos Goal: the dictionary contains fewer misspellings But often we dont change the documents but aim to fix the query-document mapping
19
Sec. 3.3
Query mis-spellings
Our principal focus here
E.g., the query Alanis Morisett
We can either
Retrieve documents indexed by the correct spelling, OR Return several suggested alternative queries with the correct spelling
Did you mean ?
20
Sec. 3.3.2
21
Sec. 3.3.2
22
Sec. 3.3.3
Edit distance
Given two strings S1 and S2, the minimum number of operations to convert one to the other Operations are typically character-level
Insert, Delete, Replace, (Transposition)
Generally found by dynamic programming. See http://www.merriampark.com/ld.htm for a nice example plus an applet.
23
Sec. 3.3.3
24
Sec. 3.3.4
The alternatives disempower the user, but save a round of interaction with the user
25
Sec. 3.3.4
How do we cut the set of candidate dictionary terms? One possibility is to use n-gram overlap for this This can also be used by itself for spelling correction.
26
Sec. 3.3.4
n-gram overlap
Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams
Variants weight by keyboard layout, etc.
27
Sec. 3.3.4
So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized measure of overlap?
28
Sec. 3.3.4
X Y / X Y
Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y dont have to be of the same size Always assigns a number between 0 and 1
Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match
29
Sec. 3.3.4
Matching trigrams
Consider the query lord we wish to identify words matching 2 of its 3 bigrams (lo, or, rd)
lo or rd
Standard postings merge will enumerate Adapt this to using Jaccard (or another) measure.
30
Sec. 3.3.5
31
Sec. 3.3.5
Context-sensitive correction
Need surrounding context to catch this. First idea: retrieve dictionary terms close (in weighted edit distance) to each query term Now try all possible resulting phrases with one word fixed at a time
flew from heathrow fled form heathrow flea form heathrow
Hit-based spelling correction: Suggest the alternative that has lots of hits.
32
Sec. 3.3.5
Exercise
Suppose that for flew form Heathrow we have 7 alternatives for flew, 19 for form and 3 for heathrow. How many corrected phrases will we enumerate in this scheme?
33
Sec. 3.3.5
Another approach
Break phrase query into a conjunction of biwords (Lecture 2). Look for biwords that need only one term corrected. Enumerate phrase matches and rank them!
34
Sec. 3.3.5
35
SOUNDEX
36
Sec. 3.4
Soundex
Class of heuristics to expand a query into phonetic equivalents
Language specific mainly for names E.g., chebyshev tchebycheff
37
Sec. 3.4
38
Sec. 3.4
Sec. 3.4
Soundex continued
4. Remove all pairs of consecutive digits. 5. Remove all zeros from the resulting string. 6. Pad the resulting string with trailing zeros and return the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>. E.g., Herman becomes H655.
Will hermann generate the same code?
40
Sec. 3.4
Soundex
Soundex is the classic algorithm, provided by most databases (Oracle, Microsoft, ) How useful is soundex? Not very for information retrieval Okay for high recall tasks (e.g., Interpol), though biased to names of certain nationalities Zobel and Dart (1996) show that other algorithms for phonetic matching perform much better in the context of IR
41
42
INDEX GENERATION
43
Ch. 4
Index construction
How do we construct an index? What strategies can we use with limited main memory?
44
Sec. 4.1
Hardware basics
Many design decisions in information retrieval are based on the characteristics of hardware We begin by reviewing hardware basics
45
Sec. 4.1
Hardware basics
Access to data in memory is much faster than access to data on disk. Disk seeks: No data is transferred from disk while the disk head is being positioned. Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks. Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB.
46
Sec. 4.1
Hardware basics
Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB. Available disk space is several (23) orders of magnitude larger. Fault tolerance is very expensive: Its much cheaper to use many regular machines rather than one fault tolerant machine. Google is particularly famous for combining standard hardware in shipping containers.
47
Sec. 4.1
Hardware assumptions
symbol statistic value s average seek time 5 ms = 5 x 103 s b transfer time per byte 0.02 s = 2 x 108 s processors clock rate 109 s1 p low-level operation 0.01 s = 108 s
(e.g., compare & swap a word)
several GB 1 TB or more
48
Sec. 4.2
Sec. 4.2
50
Sec. 4.2
Sec. 4.2
Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.
Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious
Term I did enact julius caesar I was killed i' the capitol brutus killed m e so let it be with caesar the noble brutus hath told you caesar was am bitious
Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 52 2
Sec. 4.2
Key step
After all documents have been parsed, the inverted file is sorted by terms.
Term I did enact julius caesar I was killed i' the capitol brutus killed m e so let it be with caesar the noble brutus hath told you caesar was am bitious
Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Term ambitious be brutus brutus capitol caesar caesar caesar did enact hath I I i' it julius killed killed let me noble so the the told you was was with
Doc # 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 1 2 2
53
Sec. 4.2
54
Sec. 4.2
Sec. 4.2
56
Sec. 4.2
Bottleneck
Parse and build postings entries one doc at a time Now sort postings entries by term (then by doc within each term) Doing this with random disk seeks would be too slow must sort T=100M records
If every comparison took 2 disk seeks, and N items could be sorted with N log2N comparisons, how long would this take?
57
Sec. 4.2
Sec. 4.2
59
Sec. 4.2
Exercise: estimate total time to read each block from disk and and quicksort it. 10 times this estimate gives us 10 sorted runs of 10M records each. Done straightforwardly, need 2 copies of data on disk
But can optimize this
60
Sec. 4.2
61
Sec. 4.2
1 3
2
2
Merged run.
4
3 4
Disk
62
Sec. 4.2
63
Sec. 4.3
Sec. 4.3
65
Sec. 4.3
SPIMI-Invert
66
Sec. 4.3
SPIMI: Compression
Compression makes SPIMI even more efficient.
Compression of terms Compression of postings
67
Sec. 4.4
Distributed indexing
For web-scale indexing (dont try this at home!): must use a distributed computing cluster Individual machines are fault-prone
Can unpredictably slow down or fail
68
Sec. 4.4
Sec. 4.4
70
Sec. 4.4
Distributed indexing
Maintain a master machine directing the indexing job considered safe. Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle machine from a pool.
71
Sec. 4.4
Parallel tasks
We will use two sets of parallel tasks
Parsers Inverters
Break the input document collection into splits Each split is a subset of documents (corresponding to blocks in BSBI/SPIMI)
72
Sec. 4.4
Parsers
Master assigns a split to an idle parser machine Parser reads a document at a time and emits (term, doc) pairs Parser writes pairs into j partitions Each partition is for a range of terms first letters
(e.g., a-f, g-p, q-z) here j = 3.
73
Sec. 4.4
Inverters
An inverter collects all (term,doc) pairs (= postings) for one term-partition. Sorts and writes to postings lists
74
Sec. 4.4
Data flow
assign Parser Parser splits Master assign Inverter Inverter Inverter Postings a-f g-p q-z
Reduce phase
75
Sec. 4.4
MapReduce
The index construction algorithm we just described is an instance of MapReduce. MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing without having to write code for the distribution part. They describe the Google indexing system (ca. 2002) as consisting of a number of phases, each implemented in MapReduce.
76
Sec. 4.4
MapReduce
Index construction was just one phase. Another phase: transforming a term-partitioned index into a document-partitioned index.
Term-partitioned: one machine handles a subrange of terms Document-partitioned: one machine handles a subrange of documents
As we discuss later in the course, most search engines use a document-partitioned index better load balancing, etc.
77
Sec. 4.4
Sec. 4.5
Dynamic indexing
Up to now, we have assumed that collections are static. They rarely are:
Documents come in over time and need to be inserted. Documents are deleted and modified.
This means that the dictionary and postings lists have to be modified:
Postings updates for terms already in dictionary New terms added to dictionary
79
Sec. 4.5
Simplest approach
Maintain big main index New docs go into small auxiliary index Search across both, merge results Deletions
Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector
80
Sec. 4.5
Assumption for the rest of the lecture: The index is one big file. In reality: Use a scheme somewhere in between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.)
81
Sec. 4.5
Logarithmic merge
Maintain a series of indexes, each twice as large as the previous one. Keep smallest (Z0) in memory Larger ones (I0, I1, ) on disk If Z0 gets too big (> n), write to disk as I0 or merge with I0 (if I0 already exists) as Z1 Either write merge Z1 to disk as I1 (if no I1) Or merge with I1 to form Z2 etc.
82
Sec. 4.5
83
Sec. 4.5
Logarithmic merge
Auxiliary and main index: index construction time is O(T2) as each posting is touched in each merge. Logarithmic merge: Each posting is merged O(log T) times, so complexity is O(T log T) So logarithmic merge is much more efficient for index construction But query processing now requires the merging of O(log T) indexes
Whereas it is O(1) if you just have a main and auxiliary index
84
Sec. 4.5
How do we maintain the top ones with multiple indexes and invalidation bit vectors?
One possibility: ignore everything but the main index for such ordering
Sec. 4.5
But (sometimes/typically) they also periodically reconstruct the index from scratch
Query processing is then switched to the new index, and the old index is then deleted
86
Sec. 4.5
87
Sec. 4.5
Why ?
E.g., that the trigram uou occurs in the term deciduous will be discovered on each text occurrence of deciduous Only need to process each term once
88
INDEX COMPRESSION
89
Ch. 5
Compressing Indexes
Ch. 5
91
Ch. 5
Postings file(s)
Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings in memory.
Compression lets you keep more in memory
Sec. 5.1
avg. # bytes per token 4.5 (without spaces/punct.) avg. # bytes per term 7.5 non-positional postings 100,000,000
93
Sec. 5.1
Index parameters vs. what we index (details IIR Table 5.1, p.80)
word types (terms) dictiona ry Size (K) % cumul % -2 -19 -19 -19 nonpositiona l postings nonpositional index Size (K) 109,971 -2 -17 -0 -0 100,680 96,969 83,390 67,002 -8 -3 -14 -30 -8 -12 -24 -39 % cumul % position al postings positional index Size (K) 197,879 179,158 179,158 121,858 94,517 -9 0 -31 -47 -9 -9 -38 -52 % cumul %
size of
stemming 322 -17 -42 94,517 0 Exercise: give intuitions for-33 the 0 entries. Why do some zero-52 all 63,812 -4 entries correspond to big deltas in other columns? 94
Sec. 5.1
Lossy compression: Discard some information Several of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination. Chap/Lecture 7: Prune postings entries that are unlikely to turn up in the top k list for any query.
Almost no loss quality for top k list.
95
Sec. 5.1
In practice, the vocabulary will keep growing with the collection size
Especially with Unicode
96
Sec. 5.1
97
Sec. 5.1
Heaps Law
For RCV1, the dashed line log10 M = 0.49 log10 T + 1.64 is the best least squares fit. Thus, M = 101.64 T0.49 so k = 101.64 44 and b = 0.49. Good empirical fit for Reuters RCV1 ! For first 1,000,020 tokens, law predicts 38,323 terms; actually, 38,365 terms
98
Sec. 5.1
Exercises
What is the effect of including spelling errors, vs. automatically correcting spelling errors on Heaps law? Compute the vocabulary size M for this scenario:
Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 1010 ) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps law?
99
Sec. 5.1
Zipfs law
Heaps law gives the vocabulary size in collections. We also study the relative frequencies of terms. In natural language, there are a few very frequent terms and very many very rare terms. Zipfs law: The ith most frequent term has frequency proportional to 1/i . cfi 1/i = K/i where K is a normalizing constant cfi is collection frequency: the number of occurrences of the term ti in the collection.
100
Sec. 5.1
Zipf consequences
If the most frequent term (the) occurs cf1 times
then the second most frequent term (of) occurs cf1/2 times the third most frequent term (and) occurs cf1/3 times
Equivalent: cfi = K/i where K is a normalizing factor, so log cfi = log K - log i Linear relationship between log cfi and log i
Sec. 5.1
102
Ch. 5
Compression
Now, we will consider compressing the space for the dictionary and postings Basic Boolean index only No study of positional indexes, etc. We will consider compression schemes
103
Sec. 5.2
DICTIONARY COMPRESSION
104
Sec. 5.2
105
Sec. 5.2
Postings ptr.
20 bytes
4 bytes each
106
Sec. 5.2
Sec. 5.2
.systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo.
Freq. 33 29 44 126 Postings ptr. Term ptr.
Total string length = 400K x 8B = 3.2MB Pointers resolve 3.2M positions: log23.2M = 22bits = 3bytes
108
Sec. 5.2
109
Sec. 5.2
Blocking
Store pointers to every kth term string.
Example below: k=4.
Freq. 33 29 44 126 7
Sec. 5.2
Net
Example for block size k = 4 Where we used 3 bytes/pointer without blocking
3 x 4 = 12 bytes,
now we use 3 + 4 = 7 bytes. Shaved another ~0.5MB. This reduces the size of the dictionary from 7.6 MB to 7.1 MB. We can save more with larger k.
Why not go with larger k?
111
Sec. 5.2
Exercise
Estimate the space usage (and savings compared to 7.6 MB) with blocking, for block sizes of k = 4, 8 and 16.
112
Sec. 5.2
Exercise: what if the frequencies of query terms were non-uniform but known, how would you structure the dictionary search tree?
113
Sec. 5.2
Sec. 5.2
Exercise
Estimate the impact on search performance (and slowdown compared to k=1) with blocking, for block sizes of k = 4, 8 and 16.
115
Sec. 5.2
Front coding
Front-coding:
Sorted words commonly have long common prefix store differences only (for last k-1 in a block of k)
8automata8automate9automatic10autom ation
8automat*a1e2ic3ion
Encodes automat
Sec. 5.2
117
Sec. 5.3
POSTINGS COMPRESSION
118
Sec. 5.3
Postings compression
The postings file is much larger than the dictionary, factor of at least 10. Key desideratum: store each posting compactly. A posting for our purposes is a docID. For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers. Alternatively, we can use log2 800,000 20 bits per docID. Our goal: use a lot less than 20 bits per docID.
119
Sec. 5.3
120
Sec. 5.3
Hope: most gaps can be encoded/stored with far fewer than 20 bits.
121
Sec. 5.3
122
Sec. 5.3
If the average gap for a term is G, we want to use ~log2G bits/gap entry. Key challenge: encode every integer (gap) with about as few bits as needed for that integer. This requires a variable length encoding Variable length codes achieve this by using short codes for small numbers
123
Sec. 5.3
Sec. 5.3
Example
docIDs gaps VB code 00000110 10111000 824 829 5 10000101 215406 214577 00001101 00001100 10110001
125
Sec. 5.3
There is also recent work on word-aligned codes that pack a variable number of gaps into one word
126
Unary code
Represent n as n 1s with a final 0. Unary code for 3 is 1110. Unary code for 40 is 11111111111111111111111111111111111111110 . Unary code for 80 is: 111111111111111111111111111111111111111111 111111111111111111111111111111111111110 This doesnt look promising, but.
127
Sec. 5.3
Gamma codes
We can compress better with bit-level codes
The Gamma code is the best known of these.
Represent a gap G as a pair length and offset offset is G in binary, with the leading bit cut off
For example 13 1101 101
We encode length with unary code: 1110. Gamma code of 13 is the concatenation of length and offset: 1110101
128
Sec. 5.3
129
Sec. 5.3
All gamma codes have an odd number of bits Almost within a factor of 2 of best possible, log2 G Gamma code is uniquely prefix-decodable, like VB Gamma code can be used for any distribution Gamma code is parameter-free
130
Sec. 5.3
Compressing and manipulating at the granularity of bits can be slow Variable byte encoding is aligned and thus potentially more efficient Regardless of efficiency, variable byte is conceptually simpler at little additional space cost
131
Sec. 5.3
RCV1 compression
Data structure dictionary, fixed-width dictionary, term pointers into string with blocking, k = 4 with blocking & front coding collection (text, xml markup etc) collection (text) Term-doc incidence matrix postings, uncompressed (32-bit words) postings, uncompressed (20 bits) postings, variable byte encoded postings, encoded Size in MB 11.2 7.6 7.1 5.9 3,600.0 960.0 40,000.0 400.0 250.0 116.0 101.0
132
Sec. 5.3
133
Sec. 5.3
Ch. 6
Ranked retrieval
Thus far, our queries have all been Boolean.
Documents either match or dont.
Good for expert users with precise understanding of their needs and the collection.
Also good for applications: Applications can easily consume 1000s of results.
135
Ch. 6
136
137
Ch. 6
138
Ch. 6
139
Ch. 6
140
Ch. 6
141
Ch. 6
142
Ch. 6
Sec. 6.2
1 1 1 0 1 1 1
1 1 1 1 0 0 0
0 0 0 0 0 1 1
0 1 1 0 0 1 1
0 0 1 0 0 1 1
1 0 1 0 0 1 0
Sec. 6.2
157 4 232 0 57 2 2
73 157 227 10 0 0 0
0 0 0 0 0 3 1
0 1 2 0 0 5 1
0 0 1 0 0 5 1
0 0 1 0 0 1 0
145
Term frequency tf
The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. We want to use tf when computing querydocument match scores. But how? Raw term frequency is not what we want:
A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. But not 10 times more relevant.
Relevance does not increase proportionally NB: frequency = count in with term frequency. IR
147
Sec. 6.2
Log-frequency weighting
The log frequency weight of term t in d is
wt,d 1 +log 10 tf t,d , = 0, if tf t,d >0 otherw ise
0 0, 1 1, 2 1.3, 10 2, 1000 4, etc. Score for a document-query pair: sum over terms t in both q and d: score
= tqd (1 + log tf t ,d )
Sec. 6.2.1
Document frequency
Rare terms are more informative than frequent terms
Recall stop words
Consider a term in the query that is rare in the collection (e.g., arachnocentric) A document containing this term is very likely to be relevant to the query arachnocentric We want a high weight for rare terms like arachnocentric.
149
Sec. 6.2.1
150
Sec. 6.2.1
idf weight
dft is the document frequency of t: the number of documents that contain t
dft is an inverse measure of the informativeness of t dft N
151
Sec. 6.2.1
153
Sec. 6.2.1
insurance try
10440 10422
3997 8760
Which word is a better search term (and should get a higher weight)?
154
Sec. 6.2.2
tf-idf weighting
The tf-idf weight of a term is the product of its tf weight and its idf weight.
w t ,d = (1 + log tf t ,d ) log10 ( N / df t )
Best known weighting scheme in information retrieval
Note: the - in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf
Increases with the number of occurrences within a document Increases with the rarity of the term in the collection
155
Sec. 6.2.2
Score(q,d) =
t q d
tf.idft,d
156
Sec. 6.3
0 0 0 0 0 1.9 0.11
Sec. 6.3
Documents as vectors
So we have a |V|-dimensional vector space Terms are axes of the space Documents are points or vectors in this space Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine These are very sparse vectors - most entries are zero.
158
Sec. 6.3
Queries as vectors
Key idea 1: Do the same for queries: represent them as vectors in the space Key idea 2: Rank documents according to their proximity to the query in this space proximity = similarity of vectors proximity inverse of distance Recall: We do this because we want to get away from the youre-either-in-or-out Boolean model. Instead: rank more relevant documents higher than less relevant documents
159
Sec. 6.3
Euclidean distance? Euclidean distance is a bad idea . . . . . . because Euclidean distance is large for vectors of different lengths.
160
Sec. 6.3
161
Sec. 6.3
Sec. 6.3
163
Sec. 6.3
Sec. 6.3
Length normalization
A vector can be (length-) normalized by dividing each of its components by its length for this we use the L2 norm:
x2=
xi2 i
Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere) Effect on the two documents d and d (d appended to itself) from earlier slide: they have identical vectors after length-normalization.
Long and short documents now have comparable weights
165
Sec. 6.3
cosine(query,document)
Dot product
Unit vectors
qd cos( q , d ) = = qd
q q
d = d
qi d i i =1 d i2 i =1
V
2 i =1 i
qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d or, equivalently, the cosine of the angle between q and d.
166
r r r r cos(q, d ) = q d =
V i=1
qi di
for q, d length-normalized.
167
168
Sec. 6.3
169
Sec. 6.3
cos(SaS,PaP) 0.789 0.832 + 0.515 0.555 + 0.335 0.0 + 0.0 0.0 0.94 cos(SaS,WH) 0.79 cos(PaP,WH) 0.69 Why do we have cos(SaS,PaP) > cos(SAS,WH)?
170
Sec. 6.3
171
Sec. 6.4
Columns headed n are acronyms for weight schemes. Why is the base of the log in idf immaterial?
172
Sec. 6.4
Sec. 6.4
12 + 0 2 + 12 + 1.32 1.92
174
175