You are on page 1of 30

SEARCH LUCENE & SOLR

Allahbaksh Mohammedali Asadullah

Today s Overview
What is a search engine Indexing Analyzer What is Solr Config Files

Search Engine

Lucene

Solr

Different Search engines

Searcher

Deployment Indexing Searching Features

Different Information Retr Libraries


Lucene Minion .. ..

What is
Lucene is open source Information Retrieval Software Library Lucene is Open Source (License License) Present Version 3.5 Lucene takes Text as input data and create Index on it Stores the index in File System (Can store in RAM or Harddisk) You can search over the Lucene Index Index created consists of documents wherein each document further holds field value pairs. Fields contain classified information about the document.

What is
It is not Text Extraction library It is not a crawler (robot) It is not Search Server This is not Text analytic library

Terminology
Analyzer IndexWriter Document IndexSearcher IndexReader Field org.apache.lucene.analysis org.apache.lucene.index org.apache.lucene.document org.apache.lucene.search

Lucene Indexing
org.apache.lucene.index.IndexWriter creates the index. IndexWriter writer=new IndexWriter (Directory d, Analyzer a, boolean create) where d - directory to store the index. a analyzer for the content of the files. create a boolean which indicates whether a new index needs to be created or if an existing index should be extended. Create an instance of Document. Document doc=new Document();

Lucene Indexing - contd


Add required fields to this document. doc. Add (new Field ("contents",<content>, Store.YES, Index.TOKENIZED)); Add this document to the writer object. writer.addDocument (doc); Once indexing is done, invoke optimize() writer.optimize (); Merges all segments together into a single segment, optimizing an index for search. Close the writer object. writer.close();

Analyzer

The quick brown fox jumps over the lazy dog.

WhiteSpaceAnalyzer - simplest built-in analyzer


The quick brown fox jumps over the lazy dog.

[The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog]

SimpleAnalyzer
Lowercases, splits at non-letter boundaries

The quick brown fox jumps over the lazy dog.

[the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog]

StopAnalyzer
Lower-cases and removes stop words

The quick brown fox jumps over the lazy dog.

[quick] [brown] [fox] [jumps] [over] [lazy] [dog]

SnowBallAnalyzer Stemming algorithm

The quick brown fox jumps over the lazy dog.

[the] [quick] [brown] [fox] [jump] [over] [the] [lazy] [dog]

Lucene Searching
Create an IndexSearcher object.
IndexSearcher searcher = new IndexSearcher (Directory indexDir, boolean readOnly);

Create a Query object parsing the query string.


QueryParser parser = new QueryParser (String defaultField,new StandardAnalyzer (Version.LUCENE_CURRENT));

Parse the query


Query queryObj=parser.parse (String query);

Invoke Searcher.search (Query, int) which returns TopDoc.


TopDocs topDoc = searcher .search (Query query, int topHits);

Lucene Searching -contd

Get ScoreDoc from TopDoc.


ScoreDoc[] scoreDocs = topDoc.scoreDocs;

From ScoreDoc ,get documentId


int docId = scoreDocs[i]. doc;

Get Documents from Searcher.doc(documentId)


Document doc = searcher.doc (docId);

Return the document to your application for further processing.

What is Solr
Solr is pronounced as Solar . It stands for Searching on Lucene . Web-based Indexing & Searching Server By default, comes bundled with Jetty server Can also be deployed in any other servlet container like Tomcat, Resin etc.

Advantages of Solr
Solr can replicate index on multiple servers. Uses REST based web-services for indexing and searching Indexing and search can be done simultaneously. Supports faceted searching. Supports result clustering. Supports Hit highlighting Supports Multiple output formats (XML/XSLT and JSON).

Deploying Solr
Following are the steps for deploying Solr in Jetty: y Download Solr and install it. y Root directory of Solr [eg: D:\tools\solr] is referred as SOLR_HOME. y Start Solr services. To start the service, execute start.jar present in SOLR_HOME/example/. java jar start.jar

y Default port for jetty is 8983.Once the service is started, type in URL
http://localhost:8983/solr Solr admin screen appears.

Important Config Files


solrconfig.xml - Describes the configuration of the server Response handler Faceted Search Clustering of result Query Parser Master Slave replication

schema.xml - Describes the data type Field type Analyzer and Tokenizer used on fields Copy fields Default Field

Solr Indexing

Solr indexing is done using Solr language clients.


Available for Java, Ruby, C etc.

Using Java Language Client solrj:


y y There are around 7 to 8 jars which need to copied into lib of eclipse project Modify schema.xml for the necessary fields along with Analyzer
<fields> <field name="employeeid" type="integer" indexed="true" stored="true"/> </fields>

In Java project, use solrj API for indexing  Initialize SolrServer.  Create document  Insert fields into document.  Add the documents to server.  Commit the server.

Run the application.

Solr Indexing - contd


Initialize SolrServer SolrServer _server = new CommonsHttpSolrServer ("http://localhost:8983/solr"); Insert fields into document SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField ("employeeid", "1234"); Add the documents to server _server. add (doc1); Server commit _server. commit (); Executing the application Before running the application, ensure the solar service has started. Execute the application. View in Solr by running Queries http://localhost:8983/solr Output, by default, is visible in XML format.

Solr Search

One option to search over the solr index is using SolrJ. For this, the user needs to define the server, create query object, and send the query to the server to fetch response. SolrServer _server = new CommonsHttpSolrServer ("http://localhost:8983/solr"); SolrQuery solrquery = new SolrQuery (); solrquery.setQuery (<enter query here>); QueryResponse rsp = _server.query (solrquery); In Solr, the search queries are processed by the appropriate SolrRequestHandler. Range, Prefix, Boolean, Wildcard queries are allowed in Solr.

Solr Nut and Bolts


Solr is built on top of Lucene Solr uses different handler to perform different functionality
Data Import Handler Request Handler Response Handler Faceted Search Handler

Why Solr
Replication of Index Scalable and Fault Tolerant (Depending upon the underlying infrastructure) Built in Faceted Search Capabaility

Updation of indexes Distributed Search Load Balancing

Resources
http://lucene.apache.org http://lucene.apache.org/solr
http://minion.dev.java.net/

Thank You

QUESTIONS?

You might also like