Integrate advanced search functionalities into your apps
Implement powerful multi-criteria search criteria and filters with Lucene
By John Ferguson Smart, JavaWorld.com, 09/25/06
As a full-text search engine, Lucene needs little introduction. Lucene, an open source project hosted by Apache, aims to produce high-performance full-text indexing and search software. The Java Lucene product itself is a high-performance, high capacity, full-text search tool used by many popular Websites such as the Wikipedia online encyclopedia and TheServerSide.com, as well as in many, many Java applications. It is a fast, reliable tool that has proved its value in countless demanding production environments.
Although Lucene is well known for its full-text indexing, many developers are less aware that it can also provide powerful complementary searching, filtering, and sorting functionalities. Indeed, many searches involve combining full-text searches with filters on different fields or criteria. For example, you may want to search a database of books or articles using a full-text search, but with the possibility to limit the results to certain types of books. Traditionally, this type of criteria-based searching is in the realm of the relational database. However, Lucene offers numerous powerful features that let you efficiently combine full-text searches with criteria-based searches and sorts.
Indexing
The first step in any Lucene application involves indexing your data. Lucene needs to create its own set of indexes, using your data, so it can perform high-performance full-text searching, filtering, and sorting operations on your data.
This is a fairly straightforward process. First of all, you need to create an IndexWriter object, which you use to create the Lucene index and write it to disk. Lucene is very flexible, and there are many options. Here, we will limit ourselves to creating a simple index structure in the "index" directory:
Directory directory = FSDirectory.getDirectory("index", true);
Analyzer analyser = new StandardAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyser, true);
Next, you need to index your data records. Each of your records needs to be indexed individually. When you index records in Lucene, you create a Document object for each record. For full-text indexing to work, you need to give Lucene some data that it can index. The simplest option is to write a method that writes a full-text description of your record (including everything you may wish to search on) and use this value as a searchable field. Here, we call this field "description."
You index a field by adding a new instance of the Field class to your document, as shown here:
Field field = new Field("field",
value,
Field.Store.NO,
Field.Index.TOKENIZED)
doc.add(field);
You have the option of specifying whether you want to store the value for future use (Field.Store.YES) or simply index it (Field.Store.NO). The latter option is useful for large values that you want to index, but do not need to retrieve later on.
The fourth parameter lets you indicate how you want to index the value. When you use Field.Index.TOKENIZED, the value will be analyzed, allowing Lucene to make better use of its powerful full-text indexing and search features. The downside, as we will see, is that you cannot sort results on tokenized fields.
The Field.Index.UN_TOKENIZED is useful if you want to index a field without analyzing it first. If you simply wish to store the value for future use (for example, an internal identifier), you can use Field.Index.NO.
The following code illustrates how you might index a list of items from a library catalog:
List<Item> items = Catalog.getAllItems();
for(Item item : items) {}
Document doc = new Document();
String description = item.getTitle
+ " " + item.getAuthors()
+ " " + item.getSummary()
...;
doc.add(new Field("description",
description,
Field.Store.NO,
Field.Index.TOKENIZED));
...
}
Multi-criteria indexing
The above approach works well for full-text searching, but sometimes you also need to allow more precise searches on particular fields.
Searchable fields should be tokenized, but they do need not be stored (unless you want to obtain the field values directly from the Lucene document). Imagine that you need to create a full-text index on a library catalog. The catalog contains many thousands of items such as books, articles, newspapers, video, and sound recordings. The following code illustrates how to add a searchable index on the title and ISBN number of a particular library item (in this case, a book):
doc.add(new Field("title",
item.getTitle(),
Field.Store.NO,
Field.Index.TOKENIZED));
doc.add(new Field("isbn",
item.getISBNNumber(),
Field.Store.NO,
Field.Index.TOKENIZED));
doc.add(new Field("type",
Item.BOOK,
Field.Store.NO,
Field.Index.TOKENIZED));
writer.addDocument(doc);
...
writer.close();
Sortable fields
Often you will need to display your search results in a table and let users sort the results by column. This can be done in Lucene, but there is one gotcha: your field must be UN_TOKENIZED. This means you cannot sort on a searchable index: you need to add another index with a different name. One way is to prefix the field names in some understandable way, as shown here:
// Sortable index on the title field
doc.add(new Field("sort-on-title",
book.getTitle(),
Field.Store.YES,
Field.Index.UN_TOKENIZED));
// Sortable index on the ISBN number field
doc.add(new Field("sort-on-isbn",
book.getISBNNumber(),
Field.Store.YES,
Field.Index.UN_TOKENIZED));
Full-text searches
Full-text searching in Lucene is relatively easy. A typical Lucene full-text search is shown here:
Searcher is = indexer.getIndexSearcher();
QueryParser parser = indexer.getQueryParser("description");
Query query = parser.parse("Some full-text search terms");
Hits hits = is.search(query);
Here, we use the indexer to perform a full-text search on the description field. Lucene returns a Hits object, which we can use to obtain the matching documents, as shown here:
for (int i = 0; i < searchResults.length(); i++) {
Document doc = searchResults.doc(i);
String title = (String) doc.getField("title");
System.out.println(title);
}
Multi-criteria searches
Extending this code to implement multi-criteria searches requires a bit more work. The key class we use here is the Filter class, which, as the name indicates, lets you filter search results.
The Filter class is actually an abstract class. There are several types of filter classes that let you define precise filtering operations.
The QueryFilter class lets you filter search results based on a Lucene query expression. Here, we build a filter, limiting search results to books, using the type field:
Query booksQuery = new TermQuery(new Term("type",Item.BOOK));
Filter typeFilter = new QueryFilter(booksQuery);
The RangeFilter lets you limit search results to a range of values. The following filter limits search results to items dated between 1990 and 1999 inclusive, using the year field (the last two Boolean fields indicate whether the limit values are inclusive or not):
Filter rangeFilter = new RangeFilter("year", "1990", "1999", true, true);
The ChainedFilter lets you combine other filters using logical operators such as AND, OR, XOR, or ANDNOT. In the following example, we limit search results to only the documents matching both of the above conditions:
List<Filter> filters = new ArrayList<Filter>();
filters.add(typeFilter);
filters.add(rangeFilter);
Filter filter = new ChainedFilter(filterList, ChainedFilter.AND);
You can either apply the same operator to all filters or provide an array of operators, which lets you provide different operators to be used between each filter.
You should think carefully about the operator you use for multi-criteria searches. For example, in a typical multi-criteria search, you may let users select the types of documents they want using checkboxes (books, articles, videos, etc.). Filters coming from checkbox values like these typically need to be combined using an OR expression.
On the other hand, a hotel reservation Website might provide criteria such as the number of rooms, category, or location of the hotel. These are restrictive criteria, which would need to be combined with an AND expression.
Here is an (almost) complete example, using all the features we have discussed above:
public List<CatalogItem> search(String expression,
boolean displayBooks,
boolean displayArticles,
boolean displayVideo) {
List<Filter> filters = new ArrayList<Filter>();
//
// Display books
//
if (displayBooks) {
Query booksQuery = new TermQuery(new Term("type",Item.BOOK));
filters.add(new QueryFilter(booksQuery));
}
//
// Display articles
//
if (displayArticles) {
Query articlesQuery = new TermQuery(new Term("type",Item.ARTICLE));
filters.add(new QueryFilter(articlesQuery));
}
//
// Display vidio recordings
//
if (displayVideo) {
Query videoQuery = new TermQuery(new Term("type",Item.VIDEO));
filters.add(new QueryFilter(videoQuery));
}
Filter filter = new ChainedFilter(filterList, ChainedFilter.OR);
QueryParser parser = indexer.getQueryParser("description");
Query query = parser.parse(expression);
hits = is.search(query, filter);
...
}
Sorting results
Sorting search results is a common user requirement in Web applications. Many modern component-based Web frameworks like JavaServer Faces and Tapestry have table components that let users perform sorts on each column, as do more traditional Model-View-Controller frameworks such as Struts. It is possible to sort search results in memory once they have been returned; however, this approach is wasteful and inefficient. In both traditional relational database applications, and in Lucene, it is by far more efficient to perform sorting operations at the source.
As we saw previously, Lucene lets you build indexes specifically designed for sorting results. You can only perform sorting operations on these fields, just as it is unwise to sort on unindexed fields in a relational database.
To use these fields, you use the Sort class. The simplest way to use this class is simply to create a new instance, providing the column on which you want to sort. Then you pass this Sort instance to the search() method, as shown here:
Sort sort = new Sort("name");
hits = is.search(query, filter, sort);
Going beyond this simple example, Lucene provides you with a wide palette of sorting functionalities. You can sort in reverse order by simply specifying a Boolean flag with the column name. Here, we sort by name in descending order:
Sort sort = new Sort("name", true);
Or you can sort on several columns by providing an array of column names:
String[] sortOrder = {"lastName","firstName"};
Sort sort = new Sort(sortOrder);
If you need to use different sort orders on each field, use the SortField class. Here, we sort by last name in ascending order, then by date of birth in descending order:
SortField([] sortOrder = {new SortField("lastName"),new SortField("dateOfBirth",true)};
Sort sort = new Sort(sortOrder);
Conclusion
The Lucene API is powerful, flexible, and easy-to-use. Lucene provides not only exceptional full-text searching capabilities, but also all the complementary filtering and sorting features you need to build a high-performance, feature-rich, multi-criteria full-text search into your application.
Author Bio
John Ferguson Smart has been involved in the IT industry since 1991, and in Java EE development since 1999. His specialties are Java EE architecture and development, and IT project management, including offshore project management. He has wide experience in open source Java technologies. He has worked on many large-scale Java EE projects for government and business in both hemispheres, involving international and offshore teams, and also writes technical articles in the Java EE field. His technical blog can be found at http://www.jroller.com/page/wakaleo.
- The Lucene site
http://lucene.apache.org - For more articles on Lucence, browse these JavaWorld articles
-
- "The Lucene Search EnginePowerful, Flexible, and Free," Brian Goetz (September 2006)
http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene.html - "Use Search Engine Technology for Object Persistence," Mikhail Garber (January 2005)
http://www.javaworld.com/javaworld/jw-01-2005/jw-0103-search.html
- "The Lucene Search EnginePowerful, Flexible, and Free," Brian Goetz (September 2006)
- Wikipedia online encyclopedia
http://www.wikipedia.org - TheServerSide.com
http://www.theserverside.com - Browse the Development Tools section of JavaWorld's Topical Index
http://www.javaworld.com/channel_content/jw-tools-index.shtml