Practical Full Text Search

Practical full-text search in MySQL
Bill Karwin
MySQL University 2009-12-3
Me
20+ years experience SQL maven Community contributor

MySQL, PostgreSQL, InterBase Zend Framework Oracle, SQL Server, IBM DB2, SQLite
Application/SDK developer Support, Training, Proj Mgmt C, Java, Perl, PHP
Full Text Search
In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user.
http://www.ickr.com/photos/tryingyouth/
Test Data
StackOverow.com data dump,

exported October 2009
1.5 million tuples ~1 Gigabyte
StackOverow ER diagram
searchable text
Naive Searching
Some people, when confronted with a problem, think I know, Ill use regular expressions. Now they have two problems. Jamie Zawinsky
Accuracy issue
Irrelevant or false matching words

one, money, prone, etc.: body LIKE %one%
Regular expressions in MySQL

body RLIKE [[:<:]]one[[:>:]]
support escapes for word boundaries:
Performance issue
LIKE with wildcards: POSIX regular expressions:

SELECT * FROM Posts WHERE body RLIKE performance
time: 22 sec
SELECT * FROM Posts WHERE body LIKE %performance%
time: 108 sec
Why so slow?
CREATE TABLE telephone_book ( full_name VARCHAR(50) ); CREATE INDEX name_idx ON telephone_book (full_name); INSERT INTO telephone_book VALUES (Riddle, Thomas), (Thomas, Dean);
Why so slow?
Search for all with last name Thomas uses

SELECT * FROM telephone_book WHERE full_name LIKE Thomas%
index
Search for all with rst name Thomas

SELECT * FROM telephone_book WHERE full_name LIKE %Thomas
doesnt use index
Indexes dont help searching for substrings
Solutions
1. Full-Text Indexing in SQL 2. Sphinx Search 3. Apache Lucene 4. Inverted Index 5. Search Engine Service
MySQL FULLTEXT Index
MySQL FULLTEXT Index
Special index type for MyISAM Integrated with SQL queries Balances features vs. speed vs. space
MySQL FULLTEXT:
Indexing
CREATE FULLTEXT INDEX PostText ON Posts(title, body, tags);

time: 15 min 6 sec
MySQL FULLTEXT:
Index Caching
SET GLOBAL key_buffer_size = 600*1024*1024; LOAD INDEX INTO CACHE Posts INDEX(PostText); time: 11 sec
MySQL FULLTEXT:
Querying
SELECT * FROM Posts WHERE MATCH( column(s) ) AGAINST( query pattern ); must include all columns of index, in the order dened
MySQL FULLTEXT:
Natural Language Mode

Searches concepts with free text queries:
SELECT * FROM Posts WHERE MATCH( title, body, tags ) AGAINST(improving mysql performance IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 80 milliseconds
MySQL FULLTEXT:
Boolean Mode
Searches words using mini-language:

SELECT * FROM Posts WHERE MATCH( title, body, tags ) AGAINST(+mysql +performance IN BOOLEAN MODE); time with index: 50 milliseconds
Lucene
Lucene
Apache Project since 2001 Apache License Java implementation Ports exist for other languages:

Lucy (C) Lucene.NET (C#) Zend_Search_Lucene (PHP)
PyLucene (Python) Plucene (Perl) Ferret (Ruby)
Lucene:
How to use
1. Add documents to index 2. Parse query 3. Execute query
Lucene:
Creating an index
Programmatic solution in Java...

time: 6 minutes, 50 seconds
Lucene:
Indexing
String url = "jdbc:mysql://localhost/stackoverow?" + "user=myappuser&password=xxxx"; Class.forName("org.mysql.jdbc.Driver"); Connection con = DriverManager.getConnection(url, props);
String sql = "SELECT PostId, Title, Body, Tags FROM Posts"; com.mysql.jdbc.Statement stmt = (com.mysql.jdbc.Statement) con.createStatement(); stmt.enableStreamingResults(); ResultSet rs = stmt.executeQuery(sql); new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
any SQL query
open Lucene index writer IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR),
Lucene:
Indexing
loop over SQL result
while (rs.next()) { Document doc = new Document(); doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED)); } writer.addDocument(doc);
writer.optimize(); writer.close();
each row is a Document with four Fields
nish and close index
Lucene:
Querying
Parse a Lucene query

String[] elds = new String[3]; elds[0] = Title; elds[1] = Body; elds[2] = Tags; Query q = new MultiFieldQueryParser(elds, new StandardAnalyzer()).parse(performance);
dene elds
Execute the query

Searcher s = new IndexSearcher(indexDirectory, true); Hits h = s.search(q);
parse search query
time: 120 milliseconds
Sphinx Search
Sphinx Search
Started in 2001 GPLv2 license Good database integration:
SphinxSE storage engine for MySQL
Sphinx Search:
How to use
1. Edit conguration le 2. Index the data 3. Query the index 4. Issues
Sphinx Search:
sphinx.conf
source stackoverowsrc { type = mysql sql_host = localhost sql_user = myappuser sql_pass = xxxx sql_db = stackoverow sql_query = SELECT PostId, Title, Body, Tags FROM Posts sql_query_info = SELECT * FROM Posts WHERE PostId=$id }
Sphinx Search:
sphinx.conf
index stackoverow { source = stackoverowsrc path = /opt/local/var/db/sphinx/stackoverow }
Sphinx Search:
Building index
indexer -c sphinx.conf stackoverow
collected 1517638 docs, 1021.3 MB sorted 171.5 Mhits, 100.0% done total 1517638 docs, 1021342525 bytes total 147.060 sec, 6945093.00 bytes/sec, 10319.88 docs/sec
time: 2 min 27 sec
Sphinx Search:
Querying index
search -c sphinx.conf -i stackoverow -b sql & performance

time: 12 milliseconds
Sphinx Search:
Issues
Cost to update index = cost to build index
Build a main index plus a delta index for recent changes Merge indexes periodically (much less costly) But not all data ts into this model; i.e. good for a forum, but bad for a wiki
Inverted Index
Inverted index
many-to-many relationship for Posts and words searchable words
Posts
PostTags
Tags
Inverted index:
Updated ER Diagram
new tables
Inverted index:
Data denition
CREATE TABLE Tags ( TagId SERIAL PRIMARY KEY, Tag VARCHAR(50) NOT NULL UNIQUE KEY (Tag) ); CREATE TABLE PostTags ( PostId INT NOT NULL, TagId INT NOT NULL, PRIMARY KEY (PostId, TagId), FOREIGN KEY (PostId) REFERENCES Posts (PostId), FOREIGN KEY (TagId) REFERENCES Tags (TagId) );
Inverted index:
Indexing
1. Query all Posts.Tags strings: <mysql><search><performance> 2. Loop over tag strings 3. Dump two CSV les:
time: 23.5 seconds
Tags.csv PostTags.csv
4. Load CSV les with mysqlimport
time: 5.2 seconds
Inverted index:
Querying
SELECT p.* FROM Posts p JOIN PostTags pt USING (PostId) JOIN Tags t USING (TagId) WHERE t.Tag = performance;
250 milliseconds
Inverted Index:
Is it right for you?
Best for searching selected words Simple, portable, standard SQL Not as fast as specialized technology,
but far better than using LIKE
Search Engine Services
Search engine services:
Google Custom Search Engine
http://www.google.com/cse/
even big web sites use this solution
DEMO
http://www.karwin.com/demo/gcse-demo.html
Search engine services:
Is it right for you?
Your site is public and allows external index Search is a non-critical feature for you Search results are satisfactory You need to ofoad search processing
Comparison: Time to Build Index

LIKE expression MySQL FULLTEXT Apache Lucene Sphinx Search Inverted index Google / Yahoo! none 15 min 6 min 50 sec 2 min 27 sec 28 sec ofine
Comparison: Index Storage

LIKE expression MySQL FULLTEXT Apache Lucene Sphinx Search Inverted index Google / Yahoo! none 466 MB 1323 MB 933 MB 48 MB ofine
Comparison: Query Speed

LIKE expression MySQL FULLTEXT Apache Lucene Sphinx Search Inverted index Google / Yahoo! 22 seconds 50-80 ms 120 ms 12 ms 250 ms *
Comparison: Bottom-Line
indexing storage query 2000x solution
LIKE expression MySQL FULLTEXT Apache Lucene Sphinx Search Inverted index Google / Yahoo!
none 32x 15x 5x 1x ofine
none 10x 27x 20x 1x ofine
SQL RDBMS 3rd party 3rd party SQL Service
6x 10x 1x 20x *
Copyright 2009 Bill Karwin
www.slideshare.net/billkarwin
Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions:
Attribution. You must attribute this work to Bill Karwin. Noncommercial. You may not use this work for commercial purposes. No Derivative Works. You may not alter, transform, or build upon this work.

Practical Full Text Search

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Practical Full Text Search

Uploaded by

Copyright:

Available Formats

Practical full-text search in MySQL

20+ years experience SQL maven Community contributor

Application/SDK developer Support, Training, Proj Mgmt C, Java, Perl, PHP

Full Text Search

StackOverow.com data dump,

1.5 million tuples ~1 Gigabyte

Irrelevant or false matching words

Regular expressions in MySQL

support escapes for word boundaries:

LIKE with wildcards: POSIX regular expressions:

SELECT * FROM Posts WHERE body LIKE %performance%

time: 108 sec

Search for all with last name Thomas uses

Search for all with rst name Thomas

Indexes dont help searching for substrings

MySQL FULLTEXT Index

MySQL FULLTEXT Index

CREATE FULLTEXT INDEX PostText ON Posts(title, body, tags);

Natural Language Mode

Searches words using mini-language:

PyLucene (Python) Plucene (Perl) Ferret (Ruby)

1. Add documents to index 2. Parse query 3. Execute query

Programmatic solution in Java...

any SQL query

open Lucene index writer IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR),

each row is a Document with four Fields

nish and close index

Parse a Lucene query

Execute the query

parse search query

time: 120 milliseconds

Started in 2001 GPLv2 license Good database integration:

SphinxSE storage engine for MySQL

index stackoverow { source = stackoverowsrc path = /opt/local/var/db/sphinx/stackoverow }

time: 2 min 27 sec

search -c sphinx.conf -i stackoverow -b sql & performance

4. Load CSV les with mysqlimport

time: 5.2 seconds

Is it right for you?

Search Engine Services

Search engine services:

Google Custom Search Engine

even big web sites use this solution

Search engine services:

Is it right for you?

Comparison: Time to Build Index

Comparison: Index Storage

Comparison: Query Speed

none 32x 15x 5x 1x ofine

none 10x 27x 20x 1x ofine

SQL RDBMS 3rd party 3rd party SQL Service

Copyright 2009 Bill Karwin

You might also like