Professional Documents
Culture Documents
com
India (ie, databases of keywords hosted by the web site itself. These cannot be accessed by Yahoo, Google etc.)
Search engine
A machine-constructed index (usually by
keyword)
Overall goal: Locate web documents containing a specified keyword. Input: Keyword Output: Set of links
Crawl the web, look at each page for the keyword. Follow each link to find more pages to search. Problems
Non terminating: walking in circles? Inefficient: walk web for every search? Page interpretation: Match HTML tags?
Problem: How to ignore HTML tags? Problem: How to capture words? Problem: How to capture links? Problem: How to capture Images? . Idea
Use a parser(Tokenizer)
1.
2. Create an inverted index [Off-line process] 3. Match queries to documents [On-line process, the actual retrieval]
Spider Crawls the web to find pages. Follows hyperlinks. Never stops Indexer Produces data structures for fast searching of all words in the pages (ie, it updates the lexicon) Retriever Query interface Database lookup to find hits
1 billion documents
1 TB RAM, many terabytes of disk Ranking
Thousands of servers
(WOW!)
Start with an initial page P0. Find URLs on P0 and add them to a queue When done with P0, pass it to an indexing program, get a page P1 from the queue and repeat Can be specialized (e.g. only look for email addresses) Issues
Which page to look at next? Avoid overloading a site How deep within a site do you go (depth search)? How frequently to visit pages?
Arrangement of data (data structure) to permit fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak Sorting helps. Why?
POS 1
10 20 30 36
A file is a list of words by position - First entry is the word in position 1 (first word) FILE
a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27)
INVERTED FILE
LEXICON
WORD jezebel jezer jezerit jeziah jeziel jezliah jezoar jezrahliah jezreel
jezoar
DOCID
OCCUR
POS 1
POS 2
...
NDOCS PTR 20 3 1 1 1 1 1 1 39
107 232 677 713 4 6 1 3 322 15 481 42 354 195 312 381 248 802
566 3 203 245 287
jezebel occurs 6 times in document 34, 3 times in document 44, 4 times in document 56 . . .
34 44 56
6 3 4
1 215 5
118 2291 22
3922
992
3981
5002
67
132
...
WORD INDEX
405 1897
1951
2192
Search engine for any website Not for the entire web Results can be confined to only one web site
http://www.hindu.com/2004/10/09/stories/2004100 904051900.htm :: 23 http://www.hindu.com/2004/10/09/stories/2004100 910970300.htm :: 3 http://www.hinduonnet.com/thehindu/thscrip/prin t.pl?file=2004091500081100.htm&date=2004/09 /15/&prd=bl :: 4 http://www.hinduonnet.com/thehindu/thscrip/prin t.pl?file=2002102700140200.htm&date=2002/10 /27/&prd=mag :: 7
..
..
Search Engine
Crawl
Index
Search
TheWeb
crawl
Spider
parse
Parser
Index
retrieve Query
ResultSet
ResultPage
TheWeb crawl
Priority Queue
Spider
Queue
parse Parser
AVLTree
Index
retrieve
FinalResult makePage
MergeSort& InsertionSort
ResultPage
Spider
Query
Index addPage
WebSpider
Restore
Parse
Queue
PageElement
DictionaryInterface ListDictionary TreeDictionary HashDictionary PageImg PageHref PageWord
Week 3
Tokenizer (using FSM) Crawling - Rules
Breadth First Spider Priority Based Spider
Indexing
Keywords with the occurrences of it frequency and the URLs
Persistence
Saving the Index to the Disk
Week 4
Set Data Structures Allowing Boolean Search (AND, OR)
Client and Server Architecture Client developed using Swings Multi-Threaded Server
Performance Analysis Final Demo
How can we make our algorithms and data structures more clever? What new features will our customers want?
Targeted advertising Site-specific search