You are on page 1of 34

Mahender K mahender.k@gmail.

com

Tools for finding information on the Web


Problem: hidden databases, e.g. Times of

India (ie, databases of keywords hosted by the web site itself. These cannot be accessed by Yahoo, Google etc.)

Search engine
A machine-constructed index (usually by

keyword)

So many search engines, we need search engines to find them.

Search engines: key tools for ecommerce


Buyers and sellers must find each other

How do they work?

How much do they index?


Are they reliable? How are hits ordered? Can the order be changed?

Overall goal: Locate web documents containing a specified keyword. Input: Keyword Output: Set of links

Crawl the web, look at each page for the keyword. Follow each link to find more pages to search. Problems
Non terminating: walking in circles? Inefficient: walk web for every search? Page interpretation: Match HTML tags?

Walk the web once.


Build a database. Problem: staleness How often to walk the every -changing web? Approach Periodic rebuilds of the database Specialization Accept limited staleness

Problem: How to ignore HTML tags? Problem: How to capture words? Problem: How to capture links? Problem: How to capture Images? . Idea
Use a parser(Tokenizer)

Problem: How to ignore HTML tags?


Issue: Need to extract links

Problem: How to capture words?


Idea : Use a parser (tokenizer) Parse1: HTML -page -> set-of-words Parse1: HTML -page -> set-of-links

Problem: How to ignore HTML tags?


Issue: Need to extract links

Problem: How to capture words?


Idea #3: Use a parser (tokenizer) Parse1: HTML -page -> set-of-words Parse1: HTML -page -> set-of-links Idea #4: Parse Once Parse: HTML-page -> set-of-words & set-of-links

1.

Acquire the collection, i.e. all the documents [Off-line process]

2. Create an inverted index [Off-line process] 3. Match queries to documents [On-line process, the actual retrieval]

4. Present the results to user


[On-line process: display, summarize, ...]

Spider Crawls the web to find pages. Follows hyperlinks. Never stops Indexer Produces data structures for fast searching of all words in the pages (ie, it updates the lexicon) Retriever Query interface Database lookup to find hits

1 billion documents
1 TB RAM, many terabytes of disk Ranking

Thousands of servers

(WOW!)

Web site traffic grows over 20% per month

Spiders and index over 17 Billion URLs


Supports many language and used in many countries

Over 283 million searches per day


Even we use it!
16

Start with an initial page P0. Find URLs on P0 and add them to a queue When done with P0, pass it to an indexing program, get a page P1 from the queue and repeat Can be specialized (e.g. only look for email addresses) Issues
Which page to look at next? Avoid overloading a site How deep within a site do you go (depth search)? How frequently to visit pages?

Arrangement of data (data structure) to permit fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak Sorting helps. Why?

Permits binary search. About log2n probes into list


log2(1 billion) ~ 30 Permits interpolation search. About log2(log2n) probes log2 log2(1 billion) ~ 5

POS 1
10 20 30 36

A file is a list of words by position - First entry is the word in position 1 (first word) FILE

- Entry 4562 is the word in position 4562 (4562nd word)


- Last entry is the last word An inverted file is a list of positions by word!

a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27)

INVERTED FILE

LEXICON
WORD jezebel jezer jezerit jeziah jeziel jezliah jezoar jezrahliah jezreel
jezoar

DOCID

OCCUR

POS 1

POS 2

...

NDOCS PTR 20 3 1 1 1 1 1 1 39
107 232 677 713 4 6 1 3 322 15 481 42 354 195 312 381 248 802
566 3 203 245 287

jezebel occurs 6 times in document 34, 3 times in document 44, 4 times in document 56 . . .

34 44 56

6 3 4

1 215 5

118 2291 22

2087 3010 134

3922
992

3981

5002

67

132

...

WORD INDEX

405 1897

1951

2192

Hits must be presented in some order


What order?
Relevance, popularity, reliability?

Some ranking methods


Presence of keywords in title of document Closeness of keywords to start of document Frequency of keyword in document

Link popularity (how many pages point to this one)

Search engine for any website Not for the entire web Results can be confined to only one web site

http://www.hindu.com/2004/10/09/stories/2004100 904051900.htm :: 23 http://www.hindu.com/2004/10/09/stories/2004100 910970300.htm :: 3 http://www.hinduonnet.com/thehindu/thscrip/prin t.pl?file=2004091500081100.htm&date=2004/09 /15/&prd=bl :: 4 http://www.hinduonnet.com/thehindu/thscrip/prin t.pl?file=2002102700140200.htm&date=2002/10 /27/&prd=mag :: 7

..

India ManMoh an Cricket Bollywo Sharukh Sachin .


http://www.hinduonnet.com/thehindu/ gallery/0166/016606.htm :: 2 http://www.hinduonnet.com/thehindu/ gallery/0048/004807.htm :: 1 ..

..

Search Engine

Crawl

Index

Search

TheWeb

crawl
Spider

parse

Parser

addUrls URLList addPage getNextUrl

store Indexer retrieve

Index

retrieve Query

FinalResult makePage Sort by Rank

ResultSet
ResultPage

TheWeb crawl

Priority Queue
Spider

Queue
parse Parser

Finite State Machines

addUrls URLList addPage getNextUrl

Hashtable BinaryTree LinkedList


retrieve Query Sort by Rank ResultSet store Indexer

AVLTree

Index

retrieve

FinalResult makePage

MergeSort& InsertionSort

ResultPage

SearchDriver CrawlerDriver Crawl

Spider

Query
Index addPage

WebSpider

Restore

Parse

Queue

Save PageLexer Indexer HttpTokenizer DictionaryDriver Index URLTextReader

PageElement
DictionaryInterface ListDictionary TreeDictionary HashDictionary PageImg PageHref PageWord

Inheritance Uses Calls

Week 3
Tokenizer (using FSM) Crawling - Rules
Breadth First Spider Priority Based Spider

Indexing
Keywords with the occurrences of it frequency and the URLs

Persistence
Saving the Index to the Disk

Simple Search Sorting based on Rank

Week 4
Set Data Structures Allowing Boolean Search (AND, OR)

Client and Server Architecture Client developed using Swings Multi-Threaded Server
Performance Analysis Final Demo

Thinking about the future


How does Moores Law help us?
Compute time RAM space Disk space How fast is the web growing?

How can we make our algorithms and data structures more clever? What new features will our customers want?
Targeted advertising Site-specific search

You might also like