Professional Documents
Culture Documents
Chapter 10. 5
Web Data
Importance Ranking
Rank Functions (Link Analysis)
Similarity
based on Relevance Ranking
content or text Backward Link Web Topology
(Anchor Text) Graph
Inverted Indexer
Index
Anchor Text Web Graph
Generator Constructor
Web Pages
12/08/21 Data Mining: Principles and Algorithms 5
Relevance Ranking
• Inverted index
- A data structure for supporting text queries
- like index in a book
aalborg 3452, 11437, …..
.
.
.
indexing .
.
arm 4, 19, 29, 98, 143, ...
disks with armada 145, 457, 789, ...
documents armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.
.
zz 602, 1189, 3209, ...
inverted index
12/08/21 Data Mining: Principles and Algorithms
The PageRank Algorithm
Basic idea
significance of a page is
determined by the significance of
the pages linking to it
1 if page i links to page j
More precisely: Aij
0 otherwise
Link graph: adjacency matrix A,
Constructs a probability transition matrix M by renormalizing each
row of A to sum to 1 U (1 ) M U ij 1/ n for all i, j
Treat the web graph as a markov chain (random surfer)
The vector of PageRank scores p is then defined to be the
stationary distribution of this Markov chain. Equivalently, p is the
principal right eigenvector of the transition matrix ( U (1 ) M )T
(U (1 ) M )T p p
12/08/21 Data Mining: Principles and Algorithms 7
Layout Structure
Compared to plain text, a web page is a 2D presentation
Rich visual effects created by different term types, formats,
Importance = Low
Importance = Med
Importance = High
presentation
> Free text document
related content
display, does not
necessarily reflect
semantic structure
How about XML?
A long way to go to
presentation.
Procedure:
Top-down partition the web page based on the separators
Result
A tree structure, each node in the tree corresponds to a block in the
page.
Each node will be assigned a value (Degree of Coherence) to
Hierarchy or flat
Importance = Low
Importance = Med
Importance = High
16.5
17.5
16
17
15.5
16.5 15
14.5
16
VIPS (Block Retrieval)
Baseline (Doc Retrieval) 14 VIPS (Block Retrieval)
15.5 Baseline (Doc Retrieval)
13.5
15 13
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Combining Parameter Combining Parameter
22
Average Precision (%)
18 14
16
Block QE (VIPS) 12
FullDoc QE Block QE (VIPS)
14
Baseline FullDoc QE
Baseline
12 10
3 5 10 20 30 3 5 10 20 30
Number of blocks/docs Number of blocks/docs
A B
C
12/08/21 Data Mining: Principles and Algorithms 21
A Sample of User Browsing Behavior
Block-level PageRank: W P XZ
Compute PageRank on the page-to-page graph
W B ZX
BlockRank:
Compute PageRank on the block-to-block graph
12/08/21 Data Mining: Principles and Algorithms 23
Using Block-level PageRank to Improve Search
0.165
0.16
0.155
Block-level
PageRank
0.15
Average Precision
0.145
PageRank
0.14
0.135
0.13
0.125
BLPR-Combination
PR-Combination
0.12
0.115
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
Combining Parameter
Search =IR_Score
+ (1- PageRank
Block-to-block graph: W B ZX
Block-to-image matrix (container relation): Y
1 si if I j bi
Yij
0 otherwise
Image-to-image graph:
WI Y T WB Y
ImageRank
Compute PageRank on the image graph
Image clustering
Graphical partitioning on the image graph
12/08/21 Data Mining: Principles and Algorithms 26
ImageRank
Relevance Ranking Importance Ranking Combined Ranking
Query set
45 hot queries in Google image search statistics
Ground truth
Five volunteers were chosen to evaluate the top 100
0.62
0.6
0.58
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
alpha
Six Categories
Fish
Reptile
Mammal