You are on page 1of 31

Data Mining:

Concepts and Techniques

Chapter 10. 5
Web Data

12/08/21 Data Mining: Principles and Algorithms 1


Web data

 Web is huge for effective data warehousing and


data mining: Hundreds of terabytes
 Complexity of web pages is far greater than that of
any traditional text document: Web pages lack uniform
structure, no indexing by category nor by title, author, cover
page, table of contents, etc.,
 Web is a highly dynamic information source:
Information is constantly updated. Linkage information and
access records are also updated frequently.
 Web serves a broad diversity of users: users may
have different background, interests, usage purposes
 Only a small portion of the information on the web
is relevant or useful: 99% of the web information is
useless to 99% of web users.

12/08/21 Data Mining: Principles and Algorithms 2


Web data

 Issues related to web mining:


 Mining Web page layout structure
 Mining Web’s link structures
 Mining Multimedia data on the web
 Automatic classification of web documents
 Weblog mining

12/08/21 Data Mining: Principles and Algorithms 3


Outline

 Background on Web Search

 VIPS (VIsion-based Page Segmentation)

 Block-based Web Search

 Block-based Link Analysis

 Web Image Search & Clustering

12/08/21 Data Mining: Principles and Algorithms 4


Search Engine – Two Rank Functions
Ranking based on link
Search structure analysis

Importance Ranking
Rank Functions (Link Analysis)
Similarity
based on Relevance Ranking
content or text Backward Link Web Topology
(Anchor Text) Graph
Inverted Indexer
Index
Anchor Text Web Graph
Generator Constructor

Term Dictionary Forward Forward URL


Meta Data
(Lexicon) Index Link Dictioanry

Web Page Parser

Web Pages
12/08/21 Data Mining: Principles and Algorithms 5
Relevance Ranking
• Inverted index
- A data structure for supporting text queries
- like index in a book
aalborg 3452, 11437, …..
.
.
.
indexing .
.
arm 4, 19, 29, 98, 143, ...
disks with armada 145, 457, 789, ...
documents armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.
.
zz 602, 1189, 3209, ...

inverted index
12/08/21 Data Mining: Principles and Algorithms
The PageRank Algorithm
 Basic idea
 significance of a page is
determined by the significance of
the pages linking to it
1 if page i links to page j
 More precisely: Aij  
0 otherwise
 Link graph: adjacency matrix A,
 Constructs a probability transition matrix M by renormalizing each
row of A to sum to 1 U  (1   ) M U ij  1/ n for all i, j
 Treat the web graph as a markov chain (random surfer)
 The vector of PageRank scores p is then defined to be the
stationary distribution of this Markov chain. Equivalently, p is the
principal right eigenvector of the transition matrix ( U  (1   ) M )T

(U  (1   ) M )T p  p
12/08/21 Data Mining: Principles and Algorithms 7
Layout Structure
 Compared to plain text, a web page is a 2D presentation
 Rich visual effects created by different term types, formats,

separators, blank areas, colors, pictures, etc


 Different parts of a page are not equally important

Title: CNN.com International


H1: IAEA: Iran had secret nuke agenda
H3: EXPLOSIONS ROCK BAGHDAD

TEXT BODY (with position and font
type): The International Atomic Energy
Agency has concluded that Iran has
secretly produced small amounts of
nuclear materials including low enriched
uranium and plutonium that could be used
to develop nuclear weapons according to a
confidential report obtained by CNN…
Hyperlink:
• URL: http://www.cnn.com/...
• Anchor Text: AI oaeda…
Image:
•URL: http://www.cnn.com/image/...
•Alt & Caption: Iran nuclear …

Anchor Text: CNN Homepage News …

12/08/21 Data Mining: Principles and Algorithms 8


Web Page Block—Better Information Unit

Web Page Blocks

Importance = Low

Importance = Med

Importance = High

12/08/21 Data Mining: Principles and Algorithms 9


Motivation for VIPS (VIsion-based
Page Segmentation)
 Problems of treating a web page as an atomic unit
 Web page usually contains not only pure content

 Noise: navigation, decoration, interaction, …


 Multiple topics
 Different parts of a page are not equally important

 Web page has internal structure


 Two-dimension logical structure & Visual layout

presentation
 > Free text document

 < Structured document

 Layout – the 3rd dimension of Web page


 1st dimension: content

 2nd dimension: hyperlink

12/08/21 Data Mining: Principles and Algorithms 10


Is DOM a Good Representation of Page
Structure?
 Page segmentation using
DOM
 Extract structural tags

such as P, TABLE, UL,


TITLE, H1~H6, etc
 DOM is more

related content
display, does not
necessarily reflect
semantic structure
 How about XML?
 A long way to go to

replace the HTML

12/08/21 Data Mining: Principles and Algorithms 11


VIPS Algorithm
 Motivation:
 In many cases, topics can be distinguished with visual clues. Such

as position, distance, font, color, etc.


 Goal:
 Extract the semantic structure of a web page based on its visual

presentation.
 Procedure:
 Top-down partition the web page based on the separators

 Result
 A tree structure, each node in the tree corresponds to a block in the

page.
 Each node will be assigned a value (Degree of Coherence) to

indicate how coherent of the content in the block based on visual


perception.
 Each block will be assigned an importance value

 Hierarchy or flat

12/08/21 Data Mining: Principles and Algorithms 12


VIPS: An Example
Web Page

VB1 VB2 ...

... VB2_1 VB2_2 ...

... VB2_2_1 VB2_2_2 VB2_2_3 VB2_2_4 ...

 A hierarchical structure of layout block


 A Degree of Coherence (DOC) is defined
for each block
 Show the intra coherence of the block
 DoC of child block must be no less
than its parent’s
 The Permitted Degree of Coherence
(PDOC) can be pre-defined to achieve
different granularities for the content
structure
 The segmentation will stop only when
all the blocks’ DoC is no less than
PDoC
 The smaller the PDoC, the coarser
the content structure would be
12/08/21 Data Mining: Principles and Algorithms 13
Example of Web Page Segmentation (1)

( DOM Structure ) ( VIPS Structure )

12/08/21 Data Mining: Principles and Algorithms 14


Example of Web Page Segmentation (2)

( DOM Structure ) ( VIPS Structure )

 Can be applied on web image retrieval


 Surrounding text extraction

12/08/21 Data Mining: Principles and Algorithms 15


Web Page Block—Better Information Unit

Page Segmentation Block Importance Modeling


• Vision based approach • Statistical learning

Web Page Blocks

Importance = Low

Importance = Med

Importance = High

12/08/21 Data Mining: Principles and Algorithms 16


Block-based Web Search

 Index block instead of whole page


 Block retrieval
 Combing DocRank and BlockRank

 Block query expansion


 Select expansion term from relevant blocks

12/08/21 Data Mining: Principles and Algorithms 17


Experiments
 Dataset
 TREC 2001 Web Track
 WT10g corpus (1.69 million pages), crawled at 1997.
 50 queries (topics 501-550)
 TREC 2002 Web Track
 .GOV corpus (1.25 million pages), crawled at 2002.
 49 queries (topics 551-560)
 Retrieval System
 Okapi, with weighting function BM2500
 Preprocessing
Stop-word list (about 220)

 Do not use stemming

 Do not consider phrase information

 Tune the b, k1 and k3 to achieve the best baseline

12/08/21 Data Mining: Principles and Algorithms 18


Block Retrieval on TREC 2001 and TREC 2002
18 17

16.5
17.5
16

Average Precision (%)


Average Precision (%)

17
15.5

16.5 15

14.5
16
VIPS (Block Retrieval)
Baseline (Doc Retrieval) 14 VIPS (Block Retrieval)
15.5 Baseline (Doc Retrieval)
13.5

15 13
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Combining Parameter  Combining Parameter 

TREC 2001 Result TREC 2002 Result

12/08/21 Data Mining: Principles and Algorithms 19


Query Expansion on TREC 2001 and TREC 2002
24 18

22
Average Precision (%)

Average Precision (%)


16
20

18 14

16
Block QE (VIPS) 12
FullDoc QE Block QE (VIPS)
14
Baseline FullDoc QE
Baseline
12 10
3 5 10 20 30 3 5 10 20 30
Number of blocks/docs Number of blocks/docs

TREC 2001 Result TREC 2002 Result

12/08/21 Data Mining: Principles and Algorithms 20


Block-level Link Analysis

A B

C
12/08/21 Data Mining: Principles and Algorithms 21
A Sample of User Browsing Behavior

12/08/21 Data Mining: Principles and Algorithms 22


Improving PageRank using Layout Structure

 Z: block-to-page matrix (link structure)


1 / sb if there is a link from the b th block to the p th page
Z bp 
0 otherwise
 X: page-to-block matrix (layout structure)
 f p (b) if the b th block is in the p th page
X pb 
0 otherwise
f is the block importance function

 Block-level PageRank: W P  XZ
 Compute PageRank on the page-to-page graph
W B  ZX
 BlockRank:
 Compute PageRank on the block-to-block graph
12/08/21 Data Mining: Principles and Algorithms 23
Using Block-level PageRank to Improve Search
0.165

0.16

0.155
Block-level
PageRank
0.15
Average Precision

0.145
PageRank
0.14

0.135

0.13

0.125
BLPR-Combination
PR-Combination
0.12

0.115
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 
Combining Parameter
Search =IR_Score 
+ (1- PageRank

Block-level PageRank achieves 15-25%


improvement over PageRank (SIGIR’04)
12/08/21 Data Mining: Principles and Algorithms 24
Mining Web Images Using Layout &
Link Structure (ACMMM’04)

12/08/21 Data Mining: Principles and Algorithms 25


Image Graph Model & Spectral Analysis

 Block-to-block graph: W B  ZX
 Block-to-image matrix (container relation): Y

1 si if I j  bi
Yij  
0 otherwise
 Image-to-image graph:
WI  Y T WB Y
 ImageRank
 Compute PageRank on the image graph
 Image clustering
 Graphical partitioning on the image graph
12/08/21 Data Mining: Principles and Algorithms 26
ImageRank
 Relevance Ranking  Importance Ranking  Combined Ranking

12/08/21 Data Mining: Principles and Algorithms 27


ImageRank vs. PageRank
 Dataset
 26.5 millions web pages

 11.6 millions images

 Query set
 45 hot queries in Google image search statistics

 Ground truth
 Five volunteers were chosen to evaluate the top 100

results re-turned by the system (iFind)


 Ranking method

s(x)    rankimportance (x)  (1   )  rankrelevance ( x)

12/08/21 Data Mining: Principles and Algorithms 28


ImageRank vs PageRank
Image search accuracy (ImageRank vs. PageRank)
0.68
ImageRank
0.66
PageRank
0.64
P@10

0.62

0.6

0.58
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
alpha

 Image search accuracy using ImageRank


and PageRank. Both of them achieved their
best results at =0.25.

12/08/21 Data Mining: Principles and Algorithms 29


Example on Image Clustering &
Embedding
1710 JPG images in 1287 pages are crawled within the website
http://www.yahooligans.com/content/animals/

Six Categories

Fish
Reptile
Mammal

Bird Amphibian Insect

12/08/21 Data Mining: Principles and Algorithms 30


12/08/21 Data Mining: Principles and Algorithms 31

You might also like