You are on page 1of 25

Introduction to Information Retrieval

Introduction to

Information Retrieval
CS276
Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 8: Evaluation

Introduction to Information Retrieval

Sec. 8.6

Measures for a search engine


How fast does it index
Number of documents/hour
(Average document size)

How fast does it search


Latency as a function of index size

Expressiveness of query language


Ability to express complex information needs
Speed on complex queries

Uncluttered UI
Is it free?
2

Introduction to Information Retrieval

Sec. 8.6

Measures for a search engine


All of the preceding criteria are measurable: we can
quantify speed/size
we can make expressiveness precise

The key measure: user happiness


What is this?
Speed of response/size of index are factors
But blindingly fast, useless answers wont make a user
happy

Need a way of quantifying user happiness

Introduction to Information Retrieval

Sec. 8.1

Happiness: elusive to measure


Most common proxy: relevance of search results
But how do you measure relevance?
We will detail a methodology here, then examine
its issues
Relevance measurement requires 3 elements:
1. A benchmark document collection
2. A benchmark suite of queries
3. A usually binary assessment of either Relevant or
Nonrelevant for each query and each document

Some work on more-than-binary, but not the standard


4

Introduction to Information Retrieval

Sec. 8.1

Evaluating an IR system
Note: the information need is translated into a
query
Relevance is assessed relative to the information
need not the query
E.g., Information need: I'm looking for information on
whether drinking red wine is more effective at
reducing your risk of heart attacks than white wine.
Query: wine red white heart attack effective
Evaluate whether the doc addresses the information
need, not whether it has these words
5

Introduction to Information Retrieval

Sec. 8.2

Standard relevance benchmarks


TREC - National Institute of Standards and
Technology (NIST) has run a large IR test bed for
many years
Reuters and other benchmark doc collections used
Retrieval tasks specified
sometimes as queries

Human experts mark, for each query and for each


doc, Relevant or Nonrelevant
or at least for subset of docs that some system returned
for that query
6

Introduction to Information Retrieval

Sec. 8.3

Unranked retrieval evaluation:


Precision and Recall
Precision: fraction of retrieved docs that are relevant
= P(relevant|retrieved)
Recall: fraction of relevant docs that are retrieved
= P(retrieved|relevant)
Relevant

Nonrelevant

Retrieved

tp

fp

Not Retrieved

fn

tn

Precision P = tp/(tp + fp)


Recall R = tp/(tp + fn)

Introduction to Information Retrieval

Sec. 8.3

Unranked retrieval evaluation:


Precision and Recall
Precision
The ability to retrieve top-ranked documents that are
mostly relevant.

Recall
The ability of the search to find all of the relevant items in
the corpus.

Introduction to Information Retrieval

Sec. 8.3

Precision/Recall
You can get high recall (but low precision) by
retrieving all docs for all queries!
Recall is a non-decreasing function of the number
of docs retrieved
In a good system, precision decreases as either the
number of docs retrieved or recall increases
This is not a theorem, but a result with strong empirical
confirmation
9

Introduction to Information Retrieval

Trade-off between Recall and Precision


Returns relevant documents but
misses many useful ones too

The ideal

Precision

Recall

10

Returns most relevant


documents but includes
lots of junk

Introduction to Information Retrieval

F-Measure
One measure of performance that takes into account
both recall and precision.
Harmonic mean of recall and precision:

2 PR
2
F
1 1
P R RP
Compared to arithmetic mean, both need to be high
for harmonic mean to be high.

11

Introduction to Information Retrieval

E Measure (parameterized F Measure)


A variant of F measure that allows weighting
emphasis on precision over recall:

(1 ) PR (1 )
E
2 1
2
PR

R P
2

Value of controls trade-off:


= 1: Equally weight precision and recall (E=F).
> 1: Weight recall more.
< 1: Weight precision more.

12

Introduction to Information Retrieval

Sec. 8.3

Ranked retrieval evaluation:


Precision and Recall
Precision: fraction of retrieved docs that are relevant
= P(relevant|retrieved)
Recall: fraction of relevant docs that are retrieved
= P(retrieved|relevant)
Relevant

Nonrelevant

Retrieved

tp

fp

Not Retrieved

fn

tn

Precision P = tp/(tp + fp)


Recall R = tp/(tp + fn)

13

Introduction to Information Retrieval

Computing Recall/Precision Points


For a given query, produce the ranked list of
retrievals.
Adjusting a threshold on this ranked list produces
different sets of retrieved documents, and therefore
different recall/precision measures.
Mark each document in the ranked list that is
relevant according to the gold standard.
Compute a recall/precision pair for each position in
the ranked list that contains a relevant document.
14

Introduction to Information Retrieval

Example 1
n doc # relevant
1 588
x
2 589
x
3 576
4 590
x
5 986
6 592
x
7 984
8 988
9 578
10 985
11 103
12 591
13 772
x
14 990

Let total # of relevant docs = 6


Check each new recall point:
R=1/6=0.167;

P=1/1=1

R=2/6=0.333;

P=2/2=1

R=3/6=0.5;

P=3/4=0.75

R=4/6=0.667; P=4/6=0.667

R=5/6=0.833;

p=5/13=0.38

Missing one
relevant
document.
Never reach
100% recall
15

Introduction to Information Retrieval

Example 2
n doc # relevant
1 588
x
2 576
3 589
x
4 342
5 590
x
6 717
7 984
8 772
x
9 321
x
10 498
11 113
12 628
13 772
14 592
x

Let total # of relevant docs = 6


Check each new recall point:
R=1/6=0.167;

P=1/1=1

R=2/6=0.333;

P=2/3=0.667

R=3/6=0.5;

P=3/5=0.6

R=4/6=0.667; P=4/8=0.5
R=5/6=0.833; P=5/9=0.556

R=6/6=1.0;

p=6/14=0.429
16

Introduction to Information Retrieval

Interpolating a Recall/Precision Curve


Interpolate a precision value for each standard recall
level:
rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}
r0 = 0.0, r1 = 0.1, , r10=1.0

The interpolated precision at the j-th standard recall


level is the maximum known precision at any recall
level between the j-th and (j + 1)-th level:

P(rj ) max P(r )


r j r r j 1

17

Introduction to Information Retrieval

Precision

Example 1
1.0
0.8

0.6
0.4
0.2

0.2

0.4

0.6

0.8

1.0

Recall
18

Introduction to Information Retrieval

Precision

Example 2
1.0
0.8

0.6
0.4
0.2

0.2

0.4

0.6

0.8

1.0

Recall
19

Introduction to Information Retrieval

Average Recall/Precision Curve


Typically average performance over a large set of
queries.
Compute average precision at each standard recall
level across all queries.
Plot average precision/recall curves to evaluate
overall system performance on a document/query
corpus.

20

Introduction to Information Retrieval

Compare Two or More Systems


The curve closest to the upper right-hand corner of
the graph indicates the best performance
1

Precision

0.8

NoStem

Stem

0.6
0.4
0.2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Recall
21

Introduction to Information Retrieval

Sample RP Curve for CF Corpus

22

Introduction to Information Retrieval

R- Precision
Precision at the R-th position in the ranking of results
for a query that has R relevant documents.
n doc # relevant
1 588
x
2 589
x
3 576
4 590
x
5 986
6 592
x
7 984
8 988
9 578
10 985
11 103
12 591
13 772
x
14 990

R = # of relevant docs = 6

R-Precision = 4/6 = 0.67

23

Introduction to Information Retrieval

Mean Average Precision (MAP)


Average Precision: Average of the precision values at the
points at which each relevant document is retrieved.
Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633
Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625

Mean Average Precision: Average of the average


precision value for a set of queries.

24

Introduction to Information Retrieval

25

You might also like