You are on page 1of 3

Mining Massive Data Sets

Links
Hadoop Map/Reduce Tutorial
For this project, you will be implementing a text search engine using Hadoop, an pache
Foundation implementation o! map/reduce" #ou will create map/reduce jobs to build a search
index !rom text documents, and a job which uses that index to per!orm a search" $ome parts o!
the project description re%uire a minimum o! !amiliarity with the Hadoop architecture and
map/reduce in general"
The search engine you will be creating searches !or &nglish'language sentences (subject to some
signi!icant simpli!ying assumptions) in source documents which contain all o! a gi*en list o!
words" These words may occur in any order within the sentence" +n addition to speci!ying where
these sentences can be !ound in the source documents, the engine will return the matching
sentences to the user"
For the sake o! simplicity, a sentence is de!ined to start at the termination o! the pre*ious
sentence, and consist o! all characters up to and including one or more instances o! the sentence'
ending characters ,"-./" Leading and trailing characters !rom the set ,/,()01234'/ are to be ignored
!or the purpose o! this index" s an example, the paragraph below would be split into the
!ollowing list5
t that moment a loud *oice, the *oice o! a man whose heart was inaccessible to !ear, was heard"
To this *oice responded others not less determined" 2+s e*erything thrown out.2 26o, here are
still 7,888 dollars in gold"2 hea*y bag immediately plunged into the sea"
t that moment a loud *oice, the *oice o! a man whose heart was inaccessible to !ear,
was heard"
To this *oice responded others not less determined"
+s e*erything thrown out.
6o, here are still 7,888 dollars in gold"
hea*y bag immediately plunged into the sea"
6ote that these rules allow !or surprising di*isions, such as the sentence5
2That will be three,2 replied 9encro!t: 2and with Herbert and me !i*e" ;ut the balloon will hold
six''2
""" which would be stored as5
That will be three,2 replied 9encro!t: 2and with Herbert and me !i*e"
;ut the balloon will hold six
s parsing natural language is an extremely di!!icult task, do not worry about such errors"
Likewise, we will de!ine a 2word2 as a series o! characters consisting o! only characters !rom the
set ,a'<'=3/, and beginning and ending only with characters !rom the set ,a'<'=/, or a series o!
digits" (That is to say, ,a'<'=/>,8'?/@>,a'<'=/,a'<'=3/A,a'<'=/, as a regular expression")
#our index should co*er all such se%uences in the source text, e*en i! they are not whitespace'
separated" $earch terms are to be case'insensiti*e"
Project Details
#our project will be di*ided into three parts"
9art Bne
The !irst part o! your project is a prose description o! your project implementation" +t should
speci!y the indices you build in 9art Two, as well as how you use them in 9art Three to per!orm
your search" dditionally, it should explain the structure o! your code at a high le*el, including
in!ormation such as the map/reduce operations utili<ed !or building and processing your indices"
9lease do not include detailed class or code documentation here, but do describe each map
operation, reduce operation, record reader, etc" at a 2black box2 le*el" For example, !or the
sample CordDount class, you might say5
CordDount uses a CordMap map operation which takes a CritableDomparable and a Text object
representing a line o! text, and outputs a EText, +ntCritableF tuple !or each word in that line o!
text with the +ntCritable ha*ing a *alue o! G" +t then reduces this using the DounterReduce
reduction, which accepts any CritableDomparable object and a *ector o! +ntCritable occurrence
counts, and sums these counts" The output !ormat is a directory o! !iles containing whitespace'
separated words and their counts, one word and count pair to a line"
Ce suggest that you design and document your index creation and search procedures !irst, to
ensure that you ha*e considered the index you will be implementing care!ully, and understand
how you will use it to per!orm the search"
This description must also document the top'le*el class names you chose !or each o! the two
tasks implemented in 9art Two and 9art Three, as used by the hadoop jar command"
9art Two
For the second part o! your project you will implement a Hadoop task which, gi*en a set o!
source documents and an index directory, builds an index which can be used to per!orm the
search speci!ied in the o*er*iew o*er those documents" The !ormat o! this index is up to you, but
it should be designed to be appropriate !or the map/reduce architecture" +n your description in
9art Bne, you must explain the !ormat o! your index, as well as compare and contrast searching
your index using map/reduce with a simple single'pass search through the source !iles" &xplain
why your index is appropriate !or the map/reduce process"
#our task must accept two command line arguments" The !irst argument is the name o! an HHF$
directory which contains source documents to be indexed" The second argument is the name o! a
non'existent HHF$ directory which your task will create and use to store the index it builds"
9art Three
9art Three o! your project is to implement a hadoop task which consults the document index
created by 9art Two and the original source documents, and per!orms the search" +t must create
as its output a directory !iles containing matches in the source text, as lines o! the !orm5
Esource document nameF5Ebyte o!!setF EsentenceF

For example, assuming that the sample text in the project description is in the !ile
mysteriousIisland"txt, and a search is per!ormed !or the terms 2*oice2 and 2not2, the output !ile
might look like5
mysteriousIisland"txt5G7JKL To this *oice responded others not less determined"

There should be one line !or each matching sentence in the source text" #ou need not worry
about result ordering, or di*ision among output !iles" #ou may use the Hadoop'pro*ided
TextButputFormat to write your results i! you like"
References
For assistance, you will wish to re!er to some or all o! the !ollowing5
The Moogle MapReduce paper
The Hadoop documentation
The Hadoop Map/Reduce Tutorial
The Na*a G"L"8 documentation

You might also like