<< Frequently Asked Questions About the Java HotSpot VM | 首页 | Lucene FAQ >>

Search-Enable Your Application with Lucene

The e-commerce Web site that I work on has seen several incarnations of its search feature. We started with plain vanilla SQL using "like" clauses, but this didn't perform well and left a lot to be desired in language features such as stemming (e.g., "paint" = "painter" = "painting") and synonym matching (e.g., "cat" = "feline"). Next we tried an off-the-shelf solution. This addressed our efficiency and language demands, but it was ridden with strange quirks and we were limited in how much we could customize its behavior.

Then we discovered Lucene. Lucene is an open-source search framework from Apache's Jakarta project. As a framework, Lucene provides you with the building blocks you need to build a search engine that meets your specific searching requirements. Lucene is flexible, fully customizable, and amazingly fast.

In this article I show you how to use Lucene to build a search solution for your application. Although my examples will be geared toward an e-commerce application, Lucene is flexible enough to be used on any application whether it's Web, desktop, or CD-ROM based.

I used version 1.2 of Lucene to develop the examples in this article. It can be downloaded from http://jakarta.apache.org/lucene. Lucene is self-contained, so you'll need only a JVM (v1.1.8 or higher) to use it. Place lucene-1.2.jar into your classpath and you're ready to start.

Indexing Documents
To build a Lucene index, first you'll need an instance of IndexWriter. The following lines of code create an IndexWriter for an index located at c:\myindex.

Analyzer analyzer = new StopAnalyzer();
writer = new IndexWriter("c:/myindex", analyzer, true);

The first argument to the constructor is the path where the index will be written. If the path doesn't already exist, Lucene will create it for you. The second argument is the Analyzer you want IndexWriter to use when tokenizing text. Here I used StopAnalyzer to remove stop words ("and," "or," "the," etc.) from the token stream. The last argument tells IndexWriter whether to create a new index or to add documents to an existing one. Passing true to the constructor will create the index from scratch; passing false will append to an existing index.

Now that you have an IndexWriter, you're ready to start adding documents to the index. The following code creates a simple document that represents a Web page and uses IndexWriter to add it to the index.

String url = "http://jakarta.apache.org/lucene";
String content = indexer.retrieveWebPageContent(url);
String keywords = indexer.extractKeywords(content);

Document doc = new Document();
doc.add(Field.UnIndexed("url", url));
doc.add(Field.UnStored("keywords", keywords));
doc.add(Field.Text("content", content));
writer.addDocument(doc);

In this example, the document contains the URL metadata for Lucene's homepage, a keywords field that contains search terms to match against in a search, and a "content" field that contains the full content of the Web page.

Once all documents have been added, all that remains is to close the index.

writer.close();

Although this example adds only a single (hard-coded) document to an index, it serves well as a "Hello World" example of how to create indexes using Lucene. The complete source code for this example is in Listing 1. (Listings 1-10 can be downloaded from www.sys-con.com/java/sourcec.cfm.)

For a more interesting example, suppose you're indexing a product catalog to be searched on an e-commerce Web site. A product is made up of a SKU, a name, a price, and some keywords to be searched on (see Listing 2). ProductIndexer (see Listing 3) is a convenience class used to add products to a Lucene index.

The constructor for ProductIndexer takes a string that's the path where the Lucene index will be built and a Boolean parameter that specifies whether a new index will be created or an existing index appended. ProductIndexer uses StopAnalyzer for tokenizing text.

The addProduct() method creates an instance of Document and translates the attributes of the Product into document fields. As in the simple example earlier, the "keywords" field is created as unstored so it can be searched upon but is unavailable for retrieval. The other fields are created as unindexed because these fields will be retrieved only after a successful search, not searched upon themselves.

The close() method closes the IndexWriter, making it available for searching. Before closing, however, a call is made to the IndexWriter's optimize() method to have Lucene optimize the index. Although it's entirely optional, it's generally a good idea to call optimize() if the indexing is finished for the time being and no further documents will be added to the index for a while.

ProductDBIndexer (see Listing 4) reads products from a "catalog" table in a relational database (see Table 1 for the products that I used) and uses ProductIndexer to add the products to Lucene's index. ProductDBIndexer takes two command-line arguments: the path in which to build the index and an optional "create" flag to indicate that the index should be built from scratch.

Lucene Index Structure
Lucene indexes are file based. If you look in the directory where you created the index, you'll find several files that define the Lucene index. Depending on how large your index is, you'll see several groups of files where each file in a group has the same name but a different extension. Each of these groups is known as a "segment." Although this article won't delve into the details of how Lucene segments work, it may be interesting to note that IndexWriter's optimize() method optimizes Lucene's index by consolidating all segments into a single segment for more efficient searching.

While IndexWriter is writing indexes, a file called "write.lock" is created. This file prevents other instances of IndexWriter from writing to the index concurrently. Calling IndexWriter's close() method removes this file and makes the index available for writing by another IndexWriter.

Lucene keeps track of each segment in the index using a file called "segments". During indexing, it occasionally becomes necessary for Lucene to update the segments file to keep it synchronized with the segments in the index. While this synchronization is going on, Lucene creates a "commit.lock" file to prevent concurrent updates of the segments file. Once the segments file is in sync, the commit.lock file is removed.

What would happen if you were to write to an index while it's being searched on? You may write to the index (either by adding new documents or re-creating the index from scratch) while it's being searched, but doing so may have undesirable effects on the search results. The worst side effect that I've seen is a document appearing out of order in the Hits collection. Depending on how important the ordering is to you, it may be best to create your indexes off-line (i.e., in another directory) and then rename the directory to become the current index.

Searching
Now that you've built an index, it's time to perform search queries against it. ProductSearcher (see Listing 5) shows how to do this.

To search a Lucene index you need an instance of org.apache.lucene.search.Searcher. Two subclasses of Searcher come with Lucene. IndexSearcher is for searching a single Lucene index while MultiSearcher is used to search multiple indexes at once. Only the product catalog index will be searched, so IndexSearcher is the best choice for this example. It's constructed given the path to the index.

Searcher searcher = new IndexSearcher(indexPath);

Next you must construct a Query object. The best way to do this is to use the parse() method of org.apache.lucene.queryParser.QueryParser. Create an instance of QueryParser, passing the name of the default field (the field that's searched upon by default) and an analyzer to the constructor. Then call parse() on the QueryParser instance passing the query string. An instance of org.apache .lucene.search.Query will be returned.

QueryParser queryParser = new QueryParser("keywords", new StopAnalyzer());
Query query = queryParser.parse("cat food");

Note: QueryParser is not thread-safe. A new instance of QueryParser should be created for each thread.

For this example the choice of query string is hard coded as "cat food". This query will result in all documents containing either "cat" or "food", but not necessarily both. It's possible to require that a document's keyword field contain "cat" and "food" when searching. Simply place a plus (+) sign in front of each word so that the search string will be "+cat +food" to require resulting documents to contain both "cat" and "food" in their keyword field. More advanced search options will be discussed later.

Next make a call to the Searcher's search() method, passing in the Query object.

Hits hits = searcher.search(query);

The search() method returns an instance of org.apache.lucene.search.Hits. The Hits class represents a collection of documents matching the search criteria, along with each document's relevancy score. These scores range from 0.0 to 1.0 where 1.0 is considered highly relevant and 0.0 is considered completely irrelevant (and not included in the Hits collection).

Finally, cycle through each Document returned in the Hits object displaying the SKU and name of the product along with its relevancy score.

for (int i = 0; i < hits.length(); i++) {
Document document = hits.doc(i);
float score = hits.score(i);
System.out.println(document.get("sku") + " :: " +
document.get("name") + " :: " + score);
}

Advanced Queries
Up until now, the queries have been relatively simple ones such as "cat food" and "+cat +food". QueryParser has a powerful selection of query operators to facilitate more complex searches. Table 2 lists all of QueryParser's operators.

Wildcard queries are fairly straightforward. The "*" operator can be replaced by zero or more characters to match a word. The "?" operator is replaced by exactly one character when matching. For example, "ca*" will match "cat", "car", "cap", or "candle", while "ca?" will match "cat", "car", and "cap", but not "candle". This is consistent with the behavior of "*" and "?" on a DOS or Unix command line.

The tilde (~) character, when used alone, performs a fuzzy search, matching words that are spelled similarly. For example, "cat~" will match "cat", but it will also match "car" and "rat" because these words are similarly spelled.

Surrounding two or more words with quotes (" ") produces a phrase. When two or more words are part of a phrase, those words must appear together in order to be considered a match. For example, ""dog food"" will match documents where "dog" is immediately followed by "food".

If a tilde and a number follow a phrase, then a proximity search is performed. For example, ""dog food"~10" will produce results where "dog" and "food" are found within 10 words of each other, but not necessarily adjacent to each other.

The carat (^) is a term booster. What this means is that any word followed by a carat is considered to have higher relevance than words not followed by a carat. For example, "dog^ kennel" will match where the document contains "dog" or "kennel", but will give a higher relevance to documents containing "dog".

The Boolean operators, AND, OR, and NOT behave as you would expect them to. For example, "(cat AND food) OR bird" returns all documents containing "cat" and "food" along with all documents that contain "bird". "cat NOT food" returns all documents containing "cat", but not containing "food". As you have seen before in the simple "cat food" example, OR is the default conjunction operator.

As shown in the previous example, parentheses can be used to group terms into subqueries.

As discussed, the plus sign (+) requires that a word or phrase exist in a field. Conversely, the minus sign (-) prohibits a word from appearing in the results and is roughly equivalent to NOT. For example, "dog -food" returns all documents containing "dog" but not containing "food".

Finally, there are times when you may want to search multiple fields. When constructing a QueryParser, you must specify a default field to be searched upon. Unless you specify otherwise, any words in your query will be looked for in the default field. In the examples, "keywords" is the default field. You can search on nondefault fields (assuming that they're indexed) by using a colon (:). For example, had the name field been tokenized and indexed, the query string "+cat +name:nummies" would return all documents in which the keywords field contains "cat" and the name field contains "nummies".

Customizing Lucene
While Lucene comes with an impressive set of functionality, you may still find that you want it to do something more or different than is available out of the box. As a search framework, Lucene provides several hooks for you to extend and/or modify its behavior.

In the previous examples, the analyzer chosen was StopAnalyzer. Underneath the covers, Stop-Analyzer uses LetterTokenizer to tokenize text into individual words. LetterTokenizer treats any nonalphabetic character as a delimiter. This is fine in most cases, but what if you want to tokenize text that contains numeric characters ("0" - "9") as well as alphabetic characters? This would be desirable if the keyword text contains part numbers or model numbers. LetterTokenizer wouldn't help in this case.

Listing 6 defines AlphanumericTokenizer, a tokenizer that works like LetterTokenizer except for one small difference: it treats numeric characters as token characters along with alphabetic characters. It does this by subclassing LetterTokenizer and overriding the isTokenChar() method to return the results of LetterTokenizer's isTokenChar() implementation OR'd with a call to Character.isDigit().

AlphanumStopAnalyzer (see Listing 7) is an analyzer that uses AlphanumericTokenizer. The stop-word behavior of StopAnalyzer is still desired, so AlphanumericTokenizer is wrapped with a StopFilter. To normalize the text to lowercase, StopFilter is then wrapped with LowercaseFilter. AlphanumStopAnalyzer is functionally equivalent to StopAnalyzer, except, since it uses AlphanumericTokenizer, it does not treat numeric characters as delimiters. To try out AlphanumStopAnalyzer, use it in place of StopAnalyzer in both ProductIndexer and ProductSearcher. Be sure to reindex with ProductIndexer before searching the index with the new analyzer.

Suppose that synonym-matching capability is required so that "cat" will match "kitten", "kitty", or "feline". AliasFilter (see Listing 8) is a subclass of TokenFilter that does this. AliasFilter retrieves its synonym list from entries in AliasFilter.properties. For example:

cat=feline kitten kitty
dog=canine puppy mutt
food=feed chow
parrot=bird

With each invocation of next(), AliasFilter first checks to see if there are any synonyms in the alias stack. If there are, it pops the next alias off the stack and returns it. Otherwise, AliasFilter retrieves the next token from the input TokenStream, adds any aliases that may exist to the alias stack, and then returns the next token.

AliasAnalyzer (see Listing 9) constructs a TokenStream that does everything the TokenStream from Alphanum-StopAnalyzer does, but it also uses AliasFilter to add synonyms to the TokenStream. To try AliasAnalyzer, use it as your analyzer instead of StopAnalyzer in both ProductIndexer and ProductSearch. Again, be sure to reindex before searching.

When trying AliasFilter you may discover some strange, albeit desirable, behavior. Search for "feline". Even though there are no aliases for feline, all cat-related products appear in the search results. Why? When you use AliasAnalyzer to search for "feline", the token stream does not expand beyond "feline". So why do "cat" products appear? The reason is, you also used AliasAnalyzer to index the products. When you indexed a product containing "cat", AliasAnalyzer expanded the token stream to include "kitten", "kitty", and "feline" in the index. When searching for "feline" it will be found in products whose token stream was expanded to include "feline". In effect, you get an automatic two-way aliasing between "cat" and "feline", even though it appears to be only one way in AliasFilter.properties.

Another common problem in searching is paging the results. A search query could return anywhere from zero results to a seemingly infinite number of result documents. Good usability practices suggest that you page the results, showing the user only a handful at a time. This can be accomplished in Lucene using result filters.

To create a result filter, you must subclass org.apache.lucene.search.Filter. The only required method is the bits() method. It will return a java.util.BitSet where each bit represents a document in the result set. If the bit is true, the document will be returned in Hits, otherwise it won't be returned.

PageFilter (see Listing 10) is an example of a Filter that's used to paginate search results. Given a page number and a page size, PageFilter will pare down Lucene's result set to a specific page's subset of documents. It does this by creating a BitSet big enough to hold the maximum number of result bits and then looping through the bits that need to be turned on. To use PageFilter, change ProductSearcher's call to search() to look like this:

Hits hits = searcher.search(query,new PageFilter(1,20));

This new call to search() will result in showing only the second set of 20 results.

Conclusion
Building a full-featured search engine can be a daunting task. But, thanks to Lucene, much of the complicated details are abstracted behind an easy-to-use API. We've seen how easy it can be to create an index for searching practically any type of information. We've also seen how Lucene is flexible and can be extended to satisfy custom indexing and searching requirements.

Resources

  • Jakarta Lucene: http://jakarta.apache.org/lucene
  • NLucene, the .NET implementation of Lucene at SourceForge: http://sourceforge.net/projects/nlucene
  • JGuru FAQ on Lucene: www.jguru.com/faq/Lucene
  • About Lucene's creator, Doug Cutting: http://lucene.sourceforge.net/background.html

    SIDEBAR
    Index Components
    A Lucene index is a collection of documents organized in a way that allows quick retrieval of information when arbitrarily queried upon.

    Each document (implemented by org.apache.lucene.document.Document) in a Lucene index is made up of one or more fields that are name-value pairs, much like entries in a HashMap. A document can contain as much or as little information as is required to be searched upon. For example, a Lucene document could contain the complete contents of a Web page, text file, e-mail, etc. On the other hand, a Lucene document may contain only a minimal set of metadata, such as keywords, along with a URL, a product SKU, or some other identifying information used to reference a full information source stored outside of Lucene (such as in a file system or a relational database).

    Each field in a document can be defined as being any combination of stored, indexed, and tokenized. If a field is stored, its contents are fully retrievable upon a successful search. If a field is indexed, its content may be referenced in a query and searched upon. If a field is tokenized, its content is broken into one or more tokens (or words) prior to being indexed.

    Fields can be created using org.apache.lucene.document.Field. The Field class has several static factory methods that make short work of creating field entries. Table 3 illustrates these static methods and the types of Fields that they create.

    Why would you want to index a field, but not store it? Consider a field that contains keywords for your document: chances are you'll never display or perform any processing of this field, but you still want to be able to search upon it. By indexing it you're making the field searchable, but by not storing it, you're saving space because the text is not written verbatim to the index. On the other hand, you may want to store some data so that it can be retrieved later but not actually be able to search upon it. In that case, you'd choose a field that's stored but not indexed. When defining your fields, be mindful of what those fields will be used for, and for efficiency's sake choose an appropriate field definition.

    SIDEBAR
    Search Components
    A Searcher (org.apache.lucene.search.Searcher) is used to access a Lucene index and query its contents. There are two subclasses of Searcher: IndexSearcher that searches a single index and MultiSearcher that searches one or more indexes and collects all the results in a single result set.

    Searches are performed by calling one of Searcher's search() methods and passing it a query (org.apache.lucene.search.Query). The search method returns an instance of org.apache.lucene.search.Hits. The Hits class is an array-like collection of documents that matches your query. The documents are ordered in Hits by a relevancy score.

    A Query object can be constructed using org.apache.lucene.query-Parser.QueryParser. QueryParser's parse() method parses a query string that's written in its query language and builds an appropriate Query object for that query string. QueryParser also uses an Analyzer in performing the parsing of the query string. It's not required, but it is strongly recommended that you use the same Analyzer for parsing queries that you used when indexing your documents.

    SIDEBAR
    Text Analysis Components
    When a field is tokenized, its content is broken into one or more tokens or words. Facilitating this tokenization process is the notion of an analyzer (see Figure 1). An analyzer is any subclass of org.apache.lucene.analysis.Analyzer that defines the rules for tokenization.

    A token stream is an iterator that returns the next token with each call to its next() method or returns a null when there are no more tokens in the stream. Two important subclasses of TokenStream are Tokenizer and TokenFilter. Both of these classes are abstract and must be subclassed to define the specific rules on how to tokenize content.

    At the core of the tokenization process is a Tokenizer. A Tokenizer wraps an instance of java.io.Reader and performs the actual work of breaking a stream into individual tokens (not unlike the notion of a StringTokenizer).

    TokenFilters act as decorators of other TokenStreams. Token filters can be used to add, replace, or remove tokens from a TokenStream. For example, org.apache.lucene.analysis.PorterStemFilter is a TokenFilter that replaces each word in a TokenStream with its word stem (e.g., "painting" becomes "paint").

    Analyzers rely on token streams (subclasses of org.apache.lucene.analysis.TokenStream) in defining the tokenization rules. In fact, an analyzer is nothing more than a factory for creating instances of TokenStream.

    To see how the text analysis components are used together, consider some of the TokenStream and Analyzer implementations packaged with Lucene. StopAnalyzer is an analyzer whose job is to remove stop words (e.g., "and", "or", "the", etc.) from a tokenized stream. At the core of StopAnalyzer is an instance of LowerCaseTokenizer. It tokenizes the stream into individual words, normalizing them to lowercase as it goes, where any nonalphabetic character is considered a delimiter. An instance of StopFilter decorates LowerCaseTokenizer, removing stop words from the stream as they're found. StopAnalyzer's tokenStream() method is merely a factory method that returns the decorator chain made up of LowerCaseTokenizer and StopFilter.

标签 : , ,



发表评论 发送引用通报