Professional Documents
Culture Documents
Bill Karwin
MySQL University 2009-12-3
Me
In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user.
http://www.ickr.com/photos/tryingyouth/
Test Data
StackOverow ER diagram
searchable text
Naive Searching
Some people, when confronted with a problem, think I know, Ill use regular expressions. Now they have two problems. Jamie Zawinsky
Accuracy issue
Performance issue
time: 22 sec
Why so slow?
CREATE TABLE telephone_book (
full_name
VARCHAR(50) ); CREATE INDEX name_idx ON telephone_book
(full_name); INSERT INTO telephone_book VALUES
(Riddle, Thomas),
(Thomas, Dean);
Why so slow?
Solutions
1. Full-Text Indexing in SQL 2. Sphinx Search 3. Apache Lucene 4. Inverted Index 5. Search Engine Service
Special index type for MyISAM Integrated with SQL queries Balances features vs. speed vs. space
MySQL FULLTEXT:
Indexing
MySQL FULLTEXT:
Index Caching
SET GLOBAL key_buffer_size = 600*1024*1024; LOAD INDEX INTO CACHE Posts INDEX(PostText); time: 11 sec
MySQL FULLTEXT:
Querying
SELECT * FROM Posts WHERE MATCH( column(s) ) AGAINST( query pattern ); must include all columns of index, in the order dened
MySQL FULLTEXT:
MySQL FULLTEXT:
Boolean Mode
Lucene
Lucene
Apache Project since 2001 Apache License Java implementation Ports exist for other languages:
Lucy (C) Lucene.NET (C#) Zend_Search_Lucene (PHP)
Lucene:
How to use
Lucene:
Creating an index
Lucene:
Indexing
String url = "jdbc:mysql://localhost/stackoverow?" +
"user=myappuser&password=xxxx"; Class.forName("org.mysql.jdbc.Driver"); Connection con = DriverManager.getConnection(url, props);
String sql = "SELECT PostId, Title, Body, Tags FROM Posts"; com.mysql.jdbc.Statement stmt = (com.mysql.jdbc.Statement) con.createStatement(); stmt.enableStreamingResults(); ResultSet rs = stmt.executeQuery(sql); new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
Lucene:
Indexing
loop over SQL result
while (rs.next()) { Document doc = new Document(); doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED)); } writer.addDocument(doc);
writer.optimize(); writer.close();
Lucene:
Querying
dene elds
Sphinx Search
Sphinx Search
Sphinx Search:
How to use
1. Edit conguration le 2. Index the data 3. Query the index 4. Issues
Sphinx Search:
sphinx.conf
source stackoverowsrc {
type = mysql
sql_host = localhost
sql_user = myappuser
sql_pass = xxxx
sql_db = stackoverow
sql_query = SELECT PostId, Title, Body, Tags FROM Posts
sql_query_info = SELECT * FROM Posts WHERE PostId=$id }
Sphinx Search:
sphinx.conf
Sphinx Search:
Building index
indexer -c sphinx.conf stackoverow
collected 1517638 docs, 1021.3 MB sorted 171.5 Mhits, 100.0% done total 1517638 docs, 1021342525 bytes total 147.060 sec, 6945093.00 bytes/sec, 10319.88 docs/sec
Sphinx Search:
Querying index
Sphinx Search:
Issues
Cost to update index = cost to build index
Build a main index plus a delta index for recent changes Merge indexes periodically (much less costly) But not all data ts into this model; i.e. good for a forum, but bad for a wiki
Inverted Index
Inverted index
many-to-many relationship for Posts and words searchable words
Posts
PostTags
Tags
Inverted index:
Updated ER Diagram
new tables
Inverted index:
Data denition
CREATE TABLE Tags (
TagId
SERIAL PRIMARY KEY,
Tag
VARCHAR(50) NOT NULL
UNIQUE KEY (Tag) ); CREATE TABLE PostTags (
PostId
INT NOT NULL,
TagId
INT NOT NULL,
PRIMARY KEY (PostId, TagId),
FOREIGN KEY (PostId) REFERENCES Posts (PostId),
FOREIGN KEY (TagId) REFERENCES Tags (TagId) );
Inverted index:
Indexing
1. Query all Posts.Tags strings: <mysql><search><performance> 2. Loop over tag strings 3. Dump two CSV les:
time: 23.5 seconds
Tags.csv PostTags.csv
Inverted index:
Querying
SELECT p.* FROM Posts p JOIN PostTags pt USING (PostId) JOIN Tags t USING (TagId) WHERE t.Tag = performance;
250 milliseconds
Inverted Index:
Best for searching selected words Simple, portable, standard SQL Not as fast as specialized technology,
but far better than using LIKE
http://www.google.com/cse/
DEMO
http://www.karwin.com/demo/gcse-demo.html
Your site is public and allows external index Search is a non-critical feature for you Search results are satisfactory You need to ofoad search processing
Comparison: Bottom-Line
indexing storage query 2000x solution
LIKE expression MySQL FULLTEXT Apache Lucene Sphinx Search Inverted index Google / Yahoo!
6x 10x 1x 20x *
www.slideshare.net/billkarwin
Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions:
Attribution. You must attribute this work to Bill Karwin. Noncommercial. You may not use this work for commercial purposes. No Derivative Works. You may not alter, transform, or build upon this work.