Mini Google

Mahender K mahender.k@gmail.
com
Tools for finding information on the Web

Problem: hidden databases, e.g. Times of
India (ie, databases of keywords hosted by the web site itself. These cannot be accessed by Yahoo, Google etc.)
Search engine
A machine-constructed index (usually by
keyword)
So many search engines, we need search engines to find them.
Search engines: key tools for ecommerce

Buyers and sellers must find each other
How do they work?
How much do they index?

Are they reliable? How are hits ordered? Can the order be changed?
Overall goal: Locate web documents containing a specified keyword. Input: Keyword Output: Set of links
Crawl the web, look at each page for the keyword. Follow each link to find more pages to search. Problems
Non terminating: walking in circles? Inefficient: walk web for every search? Page interpretation: Match HTML tags?
Walk the web once.

Build a database. Problem: staleness How often to walk the every -changing web? Approach Periodic rebuilds of the database Specialization Accept limited staleness
Problem: How to ignore HTML tags? Problem: How to capture words? Problem: How to capture links? Problem: How to capture Images? . Idea
Use a parser(Tokenizer)
Problem: How to ignore HTML tags?

Issue: Need to extract links
Problem: How to capture words?

Idea : Use a parser (tokenizer) Parse1: HTML -page -> set-of-words Parse1: HTML -page -> set-of-links
Problem: How to ignore HTML tags?

Issue: Need to extract links
Problem: How to capture words?

Idea #3: Use a parser (tokenizer) Parse1: HTML -page -> set-of-words Parse1: HTML -page -> set-of-links Idea #4: Parse Once Parse: HTML-page -> set-of-words & set-of-links
1.
Acquire the collection, i.e. all the documents [Off-line process]
2. Create an inverted index [Off-line process] 3. Match queries to documents [On-line process, the actual retrieval]
4. Present the results to user

[On-line process: display, summarize, ...]
Spider Crawls the web to find pages. Follows hyperlinks. Never stops Indexer Produces data structures for fast searching of all words in the pages (ie, it updates the lexicon) Retriever Query interface Database lookup to find hits
1 billion documents
1 TB RAM, many terabytes of disk Ranking
Thousands of servers
(WOW!)
Web site traffic grows over 20% per month
Spiders and index over 17 Billion URLs

Supports many language and used in many countries
Over 283 million searches per day

Even we use it!
16
Start with an initial page P0. Find URLs on P0 and add them to a queue When done with P0, pass it to an indexing program, get a page P1 from the queue and repeat Can be specialized (e.g. only look for email addresses) Issues
Which page to look at next? Avoid overloading a site How deep within a site do you go (depth search)? How frequently to visit pages?
Arrangement of data (data structure) to permit fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak Sorting helps. Why?
Permits binary search. About log2n probes into list

log2(1 billion) ~ 30 Permits interpolation search. About log2(log2n) probes log2 log2(1 billion) ~ 5
POS 1
10 20 30 36
A file is a list of words by position - First entry is the word in position 1 (first word) FILE
- Entry 4562 is the word in position 4562 (4562nd word)

- Last entry is the last word An inverted file is a list of positions by word!
a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27)
INVERTED FILE
LEXICON
WORD jezebel jezer jezerit jeziah jeziel jezliah jezoar jezrahliah jezreel
jezoar
DOCID
OCCUR
POS 1
POS 2
...
NDOCS PTR 20 3 1 1 1 1 1 1 39
107 232 677 713 4 6 1 3 322 15 481 42 354 195 312 381 248 802
566 3 203 245 287
jezebel occurs 6 times in document 34, 3 times in document 44, 4 times in document 56 . . .
34 44 56
6 3 4
1 215 5
118 2291 22
2087 3010 134
3922
992
3981
5002
67
132
...
WORD INDEX
405 1897
1951
2192
Hits must be presented in some order

What order?
Relevance, popularity, reliability?
Some ranking methods

Presence of keywords in title of document Closeness of keywords to start of document Frequency of keyword in document
Link popularity (how many pages point to this one)
Search engine for any website Not for the entire web Results can be confined to only one web site
http://www.hindu.com/2004/10/09/stories/2004100 904051900.htm :: 23 http://www.hindu.com/2004/10/09/stories/2004100 910970300.htm :: 3 http://www.hinduonnet.com/thehindu/thscrip/prin t.pl?file=2004091500081100.htm&date=2004/09 /15/&prd=bl :: 4 http://www.hinduonnet.com/thehindu/thscrip/prin t.pl?file=2002102700140200.htm&date=2002/10 /27/&prd=mag :: 7
..
India ManMoh an Cricket Bollywo Sharukh Sachin .

http://www.hinduonnet.com/thehindu/ gallery/0166/016606.htm :: 2 http://www.hinduonnet.com/thehindu/ gallery/0048/004807.htm :: 1 ..
..
Search Engine
Crawl
Index
Search
TheWeb
crawl
Spider
parse
Parser
addUrls URLList addPage getNextUrl
store Indexer retrieve
Index
retrieve Query
FinalResult makePage Sort by Rank
ResultSet
ResultPage
TheWeb crawl
Priority Queue
Spider
Queue
parse Parser
Finite State Machines
addUrls URLList addPage getNextUrl
Hashtable BinaryTree LinkedList

retrieve Query Sort by Rank ResultSet store Indexer
AVLTree
Index
retrieve
FinalResult makePage
MergeSort& InsertionSort
ResultPage
SearchDriver CrawlerDriver Crawl
Spider
Query
Index addPage
WebSpider
Restore
Parse
Queue
Save PageLexer Indexer HttpTokenizer DictionaryDriver Index URLTextReader
PageElement
DictionaryInterface ListDictionary TreeDictionary HashDictionary PageImg PageHref PageWord
Inheritance Uses Calls
Week 3
Tokenizer (using FSM) Crawling - Rules
Breadth First Spider Priority Based Spider
Indexing
Keywords with the occurrences of it frequency and the URLs
Persistence
Saving the Index to the Disk
Simple Search Sorting based on Rank
Week 4
Set Data Structures Allowing Boolean Search (AND, OR)
Client and Server Architecture Client developed using Swings Multi-Threaded Server
Performance Analysis Final Demo
Thinking about the future

How does Moores Law help us?
Compute time RAM space Disk space How fast is the web growing?
How can we make our algorithms and data structures more clever? What new features will our customers want?
Targeted advertising Site-specific search

Mini Google

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mini Google

Uploaded by

Copyright:

Available Formats

Mahender K mahender.k@gmail.

Tools for finding information on the Web

So many search engines, we need search engines to find them.

Search engines: key tools for ecommerce

How do they work?

How much do they index?

Walk the web once.

Problem: How to ignore HTML tags?

Problem: How to capture words?

Problem: How to ignore HTML tags?

Problem: How to capture words?

Acquire the collection, i.e. all the documents [Off-line process]

4. Present the results to user

Web site traffic grows over 20% per month

Spiders and index over 17 Billion URLs

Over 283 million searches per day

Permits binary search. About log2n probes into list

- Entry 4562 is the word in position 4562 (4562nd word)

2087 3010 134

Hits must be presented in some order

Some ranking methods

Link popularity (how many pages point to this one)

India ManMoh an Cricket Bollywo Sharukh Sachin .

addUrls URLList addPage getNextUrl

store Indexer retrieve

FinalResult makePage Sort by Rank

Finite State Machines

addUrls URLList addPage getNextUrl

Hashtable BinaryTree LinkedList

SearchDriver CrawlerDriver Crawl

Save PageLexer Indexer HttpTokenizer DictionaryDriver Index URLTextReader

Inheritance Uses Calls

Simple Search Sorting based on Rank

Thinking about the future

You might also like