Lucene FAQ
This is the official Lucene FAQ.
If you have a question about using Java Lucene, please do not add it directly to this FAQ. Join the Java User mailing list and email your question there. Questions should only be added to this Wiki page when they already have an answer that can be added at the same time.
- Lucene FAQ
- General
- Are there any mailing lists available?
- What Java version is required to run Lucene?
- Will Lucene work with my Java application?
- How can I get the latest greatest development code?
- Where can I get the javadocs for the org.apache.lucene classes?
- Where does the name Lucene come from?
- Are there any alternatives to Lucene?
- Does Lucene have a web crawler?
- Why am I getting an IOException that says "Too many open files"?
- When I compile Lucene x.y.z from source, the version number in the jar file name and MANIFEST.MF is different. What's up with that?
- How do I contribute an improvement?
- Why hasn't patch FOO been committed?
- What are the backwards compatibility commitments?
- How do i get code written for Lucene 1.4.x to work with Lucene 2.x?
- Searching
- Why am I getting no hits / incorrect hits?
- Why am I getting a TooManyClauses exception?
- How can I search over multiple fields?
- What wildcard search support is available from Lucene?
- Is the QueryParser thread-safe?
- How do I restrict searches to only return results from a limited subset of documents in the index (e.g. for privacy reasons)? What is the best way to approach this?
- What is the order of fields returned by Document.fields()?
- How does one determine which documents do not have a certain term?
- How do I get the last document added that has a particular term?
- Does MultiSearcher do anything particularly efficient to search multiple indices or does it simply search one after the other?
- Is there a way to use a proximity operator (like near or within) with Lucene?
- Are Wildcard, Prefix, and Fuzzy queries case sensitive?
- Why does IndexReader's maxDoc() return an 'incorrect' number of documents sometimes?
- Is there a way to get a text summary of an indexed document with Lucene (a.k.a. a "snippet" or "fragment") to display along with the search result?
- Can I search an index while it is being optimized?
- Can I cache search results with Lucene?
- Is the IndexSearcher thread-safe?
- Is there a way to retrieve the original term positions during the search?
- How do I retrieve all the values of a particular field that exists within an index, across all documents?
- Can Lucene do a "search within search", so that the second search is constrained by the results of the first query?
- Does the position of the matches in the text affect the scoring?
- How do I make sure that a match in a document title has greater weight than than a match in a document body?
- How do I find similar documents?
- Can I filter by score?
- How can I cluster results, i.e. create groups of similar documents?
- How do I implement paging, i.e. showing result from 1-10, 11-20 etc?
- The search is slow when there are many hits.
- Why do I sometimes get a FileNotFoundException when I search and update my index at the same time?
- Indexing
- Can I use Lucene to crawl my site or other sites on the Internet?
- How can I use Lucene to index a database?
- How do I perform a simple indexing of a set of documents?
- How can I add document(s) to the index?
- Where does Lucene store the index it builds?
- Can I store the Lucene index in a relational database?
- Can I store the Lucene index in a BerkeleyDB?
- I get "No tvx file". What does that mean?
- Does Lucene store a full copy of the indexed documents?
- What is the different between Stored, Tokenized, Indexed, and Vector?
- What happens when you IndexWriter.add() a document that is already in the index? Does it overwrite the previous document?
- How do I delete documents from the index?
- Is there a way to limit the size of an index?
- Why is it important to use the same analyzer type during indexing and search?
- What is index optimization and when should I use it?
- What are Segments?
- Is Lucene index database platform independent?
- When I recreate an index from scratch, do I have to delete the old index files?
- How can I index and search digits and other non-alphabetic characters?
- Is the IndexWriter class, and especially the method addIndexes(Directory[]) thread safe?
- When is it possible for document IDs to change?
- What is the purpose of write.lock file, when is it used, and by which classes?
- What is the purpose of the commit.lock file, when is it used, and by which classes?
- My program crashed and now I get a "Lock obtain timed out." error. Where is the lock and how can i delete it?
- Is there a maximum number of segment infos whose summary (name and document count) is stored in the segments file?
- What happens when I open an IndexWriter, optimize the index, and then close the IndexWriter? Which files will be added or modified?
- If I decide not to optimize the index, when will the deleted documents actually get deleted?
- How do I update a document or a set of documents that are already indexed?
- How do I write my own Analyzer?
- How do I index non Latin characters?
- How can I index HTML documents?
- How can I index XML documents?
- How can I index OpenOffice.org files?
- How can I index MS-Word documents?
- How can I index MS-Excel documents?
- How can I index MS-Powerpoint documents?
- How can I index Email (from MS-Exchange or another IMAP server) ?
- How can I index RTF documents?
- How can I index PDF documents?
- How can I index JSP files?
- If I use a compound file-style index, can I still optimize my index?
- What is the difference between IndexWriter.addIndexes(IndexReader[]) and IndexWriter.addIndexes(Directory[]), besides them taking different arguments?
- Can I use Lucene to index text in Chinese, Japanese, Korean, and other multi-byte character sets?
- Why do I have a deletable file (and old segment files remain) after running optimize?
- General
Lucene FAQ
General
Are there any mailing lists available?
There's a user list and a developer list, both available at http://lucene.apache.org/java/docs/mailinglists.html .
What Java version is required to run Lucene?
Lucene 1.4 will run with JDK 1.3 and up but requires at least JDK 1.4 to compile. Lucene >= 1.9 requires Java 1.4.
Will Lucene work with my Java application?
Yes, Lucene is 100% pure Java and has no external dependencies.
How can I get the latest greatest development code?
See SourceRepository
Where can I get the javadocs for the org.apache.lucene classes?
The docs for all the classes are available online at http://lucene.apache.org/java/docs/api/index.html. In addition, they are a part of the standard distribution, and you can always recreate them by running ant javadocs.
Where does the name Lucene come from?
Lucene is Doug Cutting's wife's middle name, and her maternal grandmother's first name.
Are there any alternatives to Lucene?
Besides commercial products which we don't know much about there's also Egothor. Also check the list of Lucene implemenations.
Does Lucene have a web crawler?
No, but check out Nutch and the list of Open Source Crawlers in Java.
Why am I getting an IOException that says "Too many open files"?
The number of files that can be opened simultaneously is a system-wide limitation of your operating system. Lucene might cause this problem as it can open quite some files depending on how you use it, but the problem might also be somewhere else.
-
Always make sure that you explicitly close all file handles you open, especially in case of errors. Use a try/catch/finally block to open the files, i.e. open them in the try block, close them in the finally block. Remember that Java doesn't have destructors, so don't close file handles in a finalize method -- this method is not guaranteed to be executed.
-
Use the compound file format (it's activated by default starting with Lucene 1.4) by calling IndexWriter's setUseCompoundFile(true)
-
Don't set IndexWriter's mergeFactor to large values. Large values speed up indexing but increase the number of files that need to be opened simultaneously.
-
If the exception occurs during searching, optimize your index calling IndexWriter's optimize() method after indexing is finished.
-
Make sure you only open one IndexSearcher, and share it among all of the threads that are doing searches -- this is safe, and it will minimize the number of files that are open concurently.
-
Try to increase the number of files that can be opened simultaneously. On Linux using bash this can be done by calling ulimit -n <number>.
When I compile Lucene x.y.z from source, the version number in the jar file name and MANIFEST.MF is different. What's up with that?
This is intentional. Only the jar files produced by the Lucene release manager will have the exact release number. Any other builds will have a different release number in order to help differentiate them from the code produced by the release process. Feel free to adjust.
How do I contribute an improvement?
Please follow all of these steps to submit a Lucene patch.
Why hasn't patch FOO been committed?
Committers are at their own discretion to decide what patches are suitable for being committed. Generally speaking, committers are encouraged to be conservative about what patches they commit. By committing code into the code base, he or she vouches for the quality of that patch. Any problems that ensue are, to some degree, the responsibility of that committer. If a committer does not feel comfortable making changes to particular sections of the code base, they may wish to consult (or defer to) a more senior committer.
The best way to encourage committers to commit a particular patch is to make it easy to apply. At a minimum it should apply easily to trunk and pass all unit tests. It should confine itself to a single issue: changing as little as possible; adding as little as possible. The patch should include new unit tests which demonstrate the bug the patch fixes (or the new functionality the patch adds). The case is stronger if others report to have successfully applied the patch and found it useful.
If one feels a patch is neglected one should be persistent, polite and patient.
What are the backwards compatibility commitments?
Here are the compatibility commitments.
How do i get code written for Lucene 1.4.x to work with Lucene 2.x?
The upgrade path for Lucene 2.0 was designed around the notion of clear deprecation warnings. Any code designed to use the APIs in Lucene 1.4.x should compile/function with Lucene 1.9 -- however many compile time deprecation warnings will be generated identifying methods that should no longer be used, and what new methods should be used instead.
If you have code that worked with Lucene 1.4.x, and you want to "port" it to Lucene 2.x you should start by downloading the 1.9 release of Lucene, and compile the code against it. Make sure deprecation warnings are turned on in your development environment, and gradually change your code until all deprecation warnings go away (the DateField class is an exception, it has not been removed in Lucene 2.0 yet).
At that point, your code should work fine with Lucene 2.x.
Searching
Why am I getting no hits / incorrect hits?
Some possible causes:
-
The desired term is in a field that was not defined as 'indexed'. Re-index the document and make the field indexed.
-
The term is in a field that was not tokenized during indexing and therefore, the entire content of the field was considered as a single term. Re-index the documents and make sure the field is tokenized.
-
The field specified in the query simply does not exist. You won't get an error message in this case, you'll just get no matches.
-
The field specified in the query has wrong case. Field names are case sensitive.
-
The term you are searching is a stop word that was dropped by the analyzer you use. For example, if your analyzer uses the StopFilter, a search for the word 'the' will always fail (i.e. produce no hits).
-
You are using different analyzers (or the same analyzer but with different stop words) for indexing and searching and as a result, the same term is transformed differently during indexing and searching.
-
The analyzer you are using is case sensitive (e.g. it does not use the LowerCaseFilter) and the term in the query has different case than the term in the document.
-
The documents you are indexing are very large. Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors. See IndexWriter.setMaxFieldLength(int).
-
Make sure to open a new IndexSearcher after adding documents. An IndexSearcher will only see the documents that were in the index in the moment it was opened.
-
If you are using the QueryParser, it may not be parsing your BooleanQuerySyntax the way you think it is.
If none of the possible causes above apply to your case, this will help you to debug the problem:
-
Use the Query's toString() method to see how it actually got parsed.
-
Use Luke to browse your index.
Why am I getting a TooManyClauses exception?
The following types of queries are expanded by Lucene before it does the search: RangeQuery, PrefixQuery, WildcardQuery, FuzzyQuery. For example, if the indexed documents contain the terms "car" and "cars" the query "ca*" will be expanded to "car OR cars" before the search takes place. The number of these terms is limited to 1024 by default. Here's a few different approaches that can be used to avoid the TooManyClauses exception:
-
Use a filter to replace the part of the query that causes the exception. For example, a RangeFilter can replace a RangeQuery on date fields and it will never throw the TooManyClauses exception -- You can even use ConstantScoreRangeQuery to execute your RangeFilter as a Query. Note that filters are slower than queries when used for the first time, so you should cache them using CachingWrapperFilter. Using Filters in place of Queries generated by QueryParser can be achieved by subclassing QueryParser and overriding the appropriate function to return a ConstantScore version of your Query.
-
Increase the number of terms using BooleanQuery.setMaxClauseCount(). Note that this will increase the memory requirements for searches that expand to many terms. To deactivate any limits, use BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE).
-
A specfic solution that can work on very precise fields is to reduce the precision of the data in order to reduce the number of terms in the index. For example, the DateField class uses a microsecond resultion, which is often not required. Instead you can save your dates in the "yyyymmddHHMM" format, maybe even without hours and minutes if you don't need them (this was simplified in Lucene 1.9 thanks to the new DateTools class).
How can I search over multiple fields?
Parse your query using MultiFieldQueryParser. Note that terms which occur in short fields have a higher effect on the result ranking. Also MultiFieldQueryParser builds queries that sometimes behave unexpectedly, namely for AND queries: it requires alls terms to appear in all field. This is not what one typically wants, for example in a search over "title" and "body" fields (Lucene 1.9 fixes this problem).
Alternatively you could create a field which concatenates the content you would like to search and search only that field.
What wildcard search support is available from Lucene?
Lucene supports wild card queries which allow you to perform searches such as book*, which will find documents containing terms such as book, bookstore, booklet, etc. Lucene refers to this type of a query as a 'prefix query'.
Lucene also supports wild card queries which allow you to place a wild card in the middle of the query term. For instance, you could make searches like: mi*pelling. That will match both misspelling, which is the correct way to spell this word, as well as mispelling, which is a common spelling mistake.
Another wild card character that you can use is '?', a question mark. The ? will match a single character. This allows you to perform queries such as Bra?il. Such a query will match both Brasil and Brazil. Lucene refers to this type of a query as a 'wildcard query'.
Note: Leading wildcards (e.g. *ook) are not supported by the QueryParser (although Lucene could handle them -- see the comment in QueryParser.jj to enable these kind of queries -- search for "OG: to support prefix queries:").
Is the QueryParser thread-safe?
No, it's not.
How do I restrict searches to only return results from a limited subset of documents in the index (e.g. for privacy reasons)? What is the best way to approach this?
The QueryFilter class is designed precisely for such cases.
Another way of doing it is the following:
Just before calling IndexSearcher.search() add a clause to the query to exclude documents in categories not permitted for this search.
If you are restricting access with a prohibited term, and someone tries to require that term, then the prohibited restriction wins. If you are restricting access with a required term, and they try prohibiting that term, then they will get no documents in their search result.
As for deciding whether to use required or prohibited terms, if possible, you should choose the method that names the less frequent term. That will make queries faster.
What is the order of fields returned by Document.fields()?
Fields are returned in the same order they were added to the document.
How does one determine which documents do not have a certain term?
There is no direct way of doing that. You could add a term "x" to every document, and then search for "+x -y" to find all of the documents that don't have "y". Note that for large collections this would be slow because of the high term frequency for term "x".
Lucene 1.9 added MatchAllDocsQuery to make this easier.
How do I get the last document added that has a particular term?
Call:
TermDocs td = IndexReader.termDocs(Term);
Then grab the last Term in TermDocs that this method returns.
Does MultiSearcher do anything particularly efficient to search multiple indices or does it simply search one after the other?
MultiSearcher searches indices sequentially. Use ParallelMultiSearcher as a searcher that performs multiple searches in parallel. Please note that there's a known bug in Lucene < 1.9 in the MultiSearcher's result ranking.
Is there a way to use a proximity operator (like near or within) with Lucene?
There is a variable called slop in PhraseQuery that allows you to perform NEAR/WITHIN-like queries.
By default, slop is set to 0 so that only exact phrases will match. However, you can alter the value using the setSlop(int) method.
When using QueryParser you can use this syntax to specify the slop: "doug cutting"~2 will find documents that contain "doug cutting" as well as ones that contain "cutting doug".
Are Wildcard, Prefix, and Fuzzy queries case sensitive?
Not, but unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer, which is the component that performs operations such as stemming.
The reason for skipping the Analyzer is that if you were searching for "dogs*" you would not want "dogs" first stemmed to "dog", since that would then match "dog*", which is not the intended query.
Why does IndexReader's maxDoc() return an 'incorrect' number of documents sometimes?
According to the Javadoc for IndexReader maxDoc() method "returns one greater than the largest possible document number".
In other words, the number returned by maxDoc() does not necessarily match the actual number of undeleted documents in the index.
Deleted documents do not get removed from the index immediately, unless you call optimize().
Is there a way to get a text summary of an indexed document with Lucene (a.k.a. a "snippet" or "fragment") to display along with the search result?
You need to store the documents' summary in the index (use Field.Store.YES when creating that field) and then use the Highlighter from the contrib area (distributed with Lucene since version 1.9 as "lucene-highlighter-(version).jar"). It's important to use a rewritten query as the input for the highlighter, i.e. call rewrite() on the query. Otherwise simple queries will work but prefix queries etc will not be highlighted.
For Lucene < 1.9, you can also get the "highlighter-dev.jar" from http://www.lucenebook.com/LuceneInAction.zip. See http://www.gossamer-threads.com/lists/lucene/java-user/31595 for a discussion of this.
Can I search an index while it is being optimized?
Yes, an index can be searched and optimized simultaneously.
Can I cache search results with Lucene?
Lucene does come with a simple cache mechanism, if you use Lucene Filters . The classes to look at are CachingWrapperFilter and QueryFilter.
Also consider using a JSP tag for caching, see http://www.opensymphony.com/oscache/ for one tab library that's easy and works well.
Is the IndexSearcher thread-safe?
Yes, IndexSearcher is thread-safe. Multiple search threads may use the same instance of IndexSearcher concurrently without any problems. It is recommended to use only one IndexSearcher from all threads in order to save memory.
Is there a way to retrieve the original term positions during the search?
Yes, see the Javadoc for IndexReader.termPositions().
How do I retrieve all the values of a particular field that exists within an index, across all documents?
The trick is to enumerate terms with that field. Terms are sorted first by field, then by text, so all terms with a given field are adjacent in enumerations. Term enumeration is also efficient.
try
{
TermEnum terms = indexReader.terms(new Term("FIELD-NAME-HERE", ""));
while ("FIELD-NAME-HERE".equals(enum.term().field()))
{
// ... collect enum.term().text() ...
if (!terms.next())
break;
}
}
finally
{
terms.close();
}
Can Lucene do a "search within search", so that the second search is constrained by the results of the first query?
Yes. There are two primary options:
-
Use QueryFilter with the previous query as the filter. (you can search the mailing list archives for QueryFilter and Doug Cutting's recommendations against using it for this purpose)
-
Combine the previous query with the current query using BooleanQuery, using the previous query as required.
The BooleanQuery approach is the recommended one.
Does the position of the matches in the text affect the scoring?
No, the position of matches within a field does not affect ranking.
How do I make sure that a match in a document title has greater weight than than a match in a document body?
If you put the title in a separate field from the body, and search both fields, matches in the title will usually be stronger without explicit boosting. This is because the scores are normalized by the length of the field, and the title tends to be much shorter than the body. Therefore, even without boosting, title matches usually come before body matches.
How do I find similar documents?
See the org.apache.lucene.search.similar package from the contrib area. It is part of Lucene starting with Lucene 1.9.
Can I filter by score?
Not safely. You can always pick an arbitrary score value and then check the Hits object to see how many results have a score higher than that value (a Binary search might come in handy) but it really doesn't give you any meaningful information because of the way score is calculated...
> Does anyone have an example of limiting results returned based on a
> score threshold? For example if I'm only interested in documents with
> a score > 0.05.
I would not recommend doing this because absolute score values in Lucene
are not meaningful (e.g., scores are not directly comparable across
searches). The ratio of a score to the highest score returned is
meaningful, but there is no absolute calibration for the highest score
returned, at least at present, so there is not a way to determine from
the scores what the quality of the result set is overall.
Here is a more detailed explanation of why this is true
How can I cluster results, i.e. create groups of similar documents?
Check out Carrot, a clustering framework that can be used with Lucene.
How do I implement paging, i.e. showing result from 1-10, 11-20 etc?
Just re-execute the search and ignore the hits you don't want to show. As people usually look only at the first results this approach is usually fast enough.
The search is slow when there are many hits.
Iterating over all hits is slow for two reasons. Firstly, the search() method that returns a Hits object re-executes the search internally when you need more than 100 hits. Solution: use the search method that takes a HitCollector instead. Secondly, the hits will probably be spread over the disk so accessing them all requires much I/O activity. This cannot easily be avoided unless the index is small enough to be loaded into RAM. If you don't need the complete documents but only one (small) field you could also use the FieldCache class to cache that one field and have fast access to it.
Why do I sometimes get a FileNotFoundException when I search and update my index at the same time?
This typically happens when people do one or more of the following things:
-
Disable the locking on one or more of the processes searching or updating the index.
-
Configure a different lockDir for at least one of the processes searching or updating the index.
-
Try to search or update an index with the lockDir configured to be on an NFS (or Samba) mounted filesystem.
Even though index searching is a read only operation, the IndexSearcher must momentarily lock the index when it is opened in order to get the list of files in the index. If locking is not configured properly it gets an incorrect list (because the list of files changes as the IndexWriter adds docs or optimizes the index). Remote filesystems (like NFS and Samba) rarely work, because they cannot make the transactional guarantees neccessary to ensure that all clients get consistent views of the directory.
Indexing
Can I use Lucene to crawl my site or other sites on the Internet?
No. Lucene does not know how to access external document, nor does it know how to extract the content and links of HTML and other document format. Lucene focuses on the indexing and searching and does it great. However, several crawlers are available which you could use: list of Open Source Crawlers in Java. regain is an Open Source tool that crawls web sites, stores them in a Lucene index and offers a search web interface. Also see Nutch for a powerful Lucene-based search engine.
How can I use Lucene to index a database?
Connect to the database using JDBC and use an SQL "SELECT" statement to query the database. Then create one Lucene Document object per row and add it to the index. You will probably want to store the ID column so you can later access the matching items. For other (text) columns it might make more sense to only index (not store) them, as the original data is still available in your database.
How do I perform a simple indexing of a set of documents?
The easiest way is to re-index the entire document set periodically or whenever it changes. All you need to do is to create an instance of IndexWriter(), iterate over your document set, create for each document a Lucene Document object and add it to the IndexWriter. When you are done make sure to close the IndexWriter. This will release all of its resources and will close the files it created.
How can I add document(s) to the index?
Simply create an IndexWriter and use its addDocument() method. Make sure to create the IndexWriter with the 'create' flag set to false and make sure to close the IndexWriter when you are done adding the documents.
Where does Lucene store the index it builds?
Typically, the index is stored in a set of files that Lucene creates in a directory of your choice. If your system uses multiple independent indices, simply create a separate directory for each index.
Lucene's API also provide a way to use or implement other storage methods such as a in-memory storage (RAMDirectory), or a mapping of Lucene data to any third party database (not included in Lucene).
Can I store the Lucene index in a relational database?
Lucene does not support that functionality out of the box, but several people have implemented JdbcDirectory's. The reports we have seen so far indicate that performance with such implementations is not great, but it is doable.
Can I store the Lucene index in a BerkeleyDB?
Yes, you use BerkeleyDB as the Lucene index store. Just use DbDirectory implementation from Lucene's contrib section.
I get "No tvx file". What does that mean?
It's a "warning" that can safely be ignored. It has been fixed (i.e. the warning has been removed) in Lucene 1.9.
Does Lucene store a full copy of the indexed documents?
It is up to you. You can tell Lucene what document information to use just for indexing and what document information to also store in the index (with or without indexing).
What is the different between Stored, Tokenized, Indexed, and Vector?
-
Stored = as-is value stored in the Lucene index
-
Tokenized = field is analyzed using the specified Analyzer - the tokens emitted are indexed
-
Indexed = the text (either as-is with keyword fields, or the tokens from tokenized fields) is made searchable (aka inverted)
-
Vectored = term frequency per document is stored in the index in an easily retrievable fashion.
What happens when you IndexWriter.add() a document that is already in the index? Does it overwrite the previous document?
No, there will be multiple copies of the same document in the index.
How do I delete documents from the index?
If you know the document number of a document (e.g. when iterating over Hits) that you want to delete you may use:
IndexReader.deleteDocument(docNum)
That will delete the document numbered docNum from the index. Once a document is deleted it will not appear in TermDocs nor TermPositions enumerations.
Attempts to read its field with the document method will result in an exception. The presence of this document may still be reflected in the docFreq statistic, though this will be corrected eventually as the index is further modified.
If you want to delete all (one or more) documents that contain a specific term you may use:
IndexReader.deleteDocuments(term)
This is useful if one uses a document field to hold a unique ID string for the document. Then to delete such a document, one merely constructs a term with the appropriate field and the unique ID string as its text and passes it to this method. Because a variable number of documents can be affected by this method call this method returns the number of documents deleted.
Starting with Lucene 1.9, the new class IndexModifier also allows deleting documents.
Is there a way to limit the size of an index?
This question is sometimes brought up because of the 2GB file size limit of some 32-bit operating systems.
This is a slightly modified answer from Doug Cutting:
The easiest thing is to use IndexWriter.setMaxMergeDocs().
If, for instance, you hit the 2GB limit at 8M documents set maxMergeDocs to 7M. That will keep Lucene from trying to merge an index that won't fit in your filesystem. It will actually effectively round this down to the next lower power of Index.mergeFactor.
So with the default mergeFactor set to 10 and maxMergeDocs set to 7M Lucene will generate a series of 1M document indexes, since merging 10 of these would exceed the maximum.
A slightly more complex solution:
You could further minimize the number of segments if, when you've added 7M documents, optimize the index and start a new index. Then use MultiSearcher to search the indexes.
An even more complex and optimal solution:
Write a version of FSDirectory that, when a file exceeds 2GB, creates a subdirectory and represents the file as a series of files.
Why is it important to use the same analyzer type during indexing and search?
The analyzer controls how the text is broken into terms which are then used to index the document. If you are using an analyzer of one type to index and an analyzer of a different type to parse the search query, it is possible that the same word will be mapped to two different terms and this will result in missing or false hits.
NOTE: It's not a rule that the same analyzer be used for both indexing and searching, and there are cases where it makes sense to use different ones (ie: when dealing with synonyms). The analyzers must be compatible though.
Also be careful with Fields that are not tokenized (like Keywords). During indexation, the Analyzer won't be called for these fields, but for a search, the QueryParser can't know this and will pass all search strings through the selected Analyzer. Usually searches for Keywords are constructed in code, but during development it can be handy to use general purpose tools (e.g. Luke) to examine your index. Those tools won't know which fields are tokenized either. In the contrib/analyzers area there's a KeywordTokenizer with an example KeywordAnalyzer for cases like this.
What is index optimization and when should I use it?
The IndexWriter class supports an optimize() method that compacts the index database and speedup queries. You may want to use this method after performing a complete indexing of your document set or after incremental updates of the index. If your incremental update adds documents frequently, you want to perform the optimization only once in a while to avoid the extra overhead of the optimization.
What are Segments?
The index database is composed of 'segments' each stored in a separate file. When you add documents to the index, new segments may be created. You can compact the database and reduce the number of segments by optimizing it (see a separate question regarding index optimization).
Is Lucene index database platform independent?
Yes, you can copy a Lucene index directory from one platform to another and it will work just as well.
When I recreate an index from scratch, do I have to delete the old index files?
No, creating the IndexWriter with "true" should remove all old files in the old index (actually with Lucene < 1.9 it removes all files in the index directory, no matter if they belong to Lucene).
How can I index and search digits and other non-alphabetic characters?
The components responsible for this are various Analyzers. Make sure you use the appropriate analyzer. For examaple, StandardAnaylzer does not remove numbers, but it removes most punctuation.
Is the IndexWriter class, and especially the method addIndexes(Directory[]) thread safe?
Yes, IndexWriter.addIndexes(Directory[]) method is thread safe (it is a synchronized method). IndexWriter in general is thread safe, i.e. you should use the same IndexWriter object from all of your threads. Actually it's impossible to use more than one IndexWriter for the same index directory, as this will lead to an exception trying to create the lock file.
When is it possible for document IDs to change?
Documents are only re-numbered after there have been deletions. Once there have been deletions, renumbering may be triggered by any document addition or index optimization. Once an index is optimized, no renumbering will be performed until more deletions are made.
If you require a persistent document id that survives deletions, then add it as a field to your documents.
What is the purpose of write.lock file, when is it used, and by which classes?
The write.lock is used to keep processes from concurrently attempting to modify an index.
It is obtained by an IndexWriter while it is open, and by an IndexReader once documents have been deleted and until it is closed.
What is the purpose of the commit.lock file, when is it used, and by which classes?
The commit.lock file is used to coordinate the contents of the 'segments' file with the files in the index. It is obtained by an IndexReader before it reads the 'segments' file, which names all of the other files in the index, and until the IndexReader has opened all of these other files.
The commit.lock is also obtained by the IndexWriter when it is about to write the segments file and until it has finished trying to delete obsolete index files.
The commit.lock should thus never be held for long, since while it is obtained files are only opened or deleted, and one small file is read or written.
My program crashed and now I get a "Lock obtain timed out." error. Where is the lock and how can i delete it?
When using FSDirectory, Lock files are kept in the directory specified by the "org.apache.lucene.lockdir" system property if it is set, or by default in the directory specified by the "java.io.tmpdir" system property (on Unix boxes this is usually "/var/tmp" or "/tmp").
If for some strange reason "java.io.tmpdir" is not set, then the directory path you specified to create your index is used.
Lock files have names that start with "lucene-" followed by an MD5 hash of the index directory path.
If you are certain that a lock file is not in use, you can delete it manually. You should also look at the methods " IndexReader.isLocked" and " IndexReader.unlock" if you are interested in writing recovery code that can remove locks automatically.
Is there a maximum number of segment infos whose summary (name and document count) is stored in the segments file?
All segments in the index are listed in the segments file. There is no hard limit. For an un-optimized index it is proportional to the log of the number of documents in the index. An optimized index contains a single segment.
What happens when I open an IndexWriter, optimize the index, and then close the IndexWriter? Which files will be added or modified?
All of the segments are merged into a single new segment file. If the index was empty to begin with, no segments will be created, only the segments file.
If I decide not to optimize the index, when will the deleted documents actually get deleted?
Documents that are deleted are marked as deleted. However, the space they consume in the index does not get reclaimed until the index is optimized. That space will also eventually be reclaimed as more documents are added to the index, even if the index does not get optimized.
How do I update a document or a set of documents that are already indexed?
There is no direct update procedure in Lucene. To update an index incrementally you must first delete the documents that were updated, and then re-add them to the index.
How do I write my own Analyzer?
Here is an example:
public class MyAnalyzer extends Analyzer
{
private static final Analyzer STANDARD = new StandardAnalyzer();
public TokenStream tokenStream(String field, final Reader reader)
{
// do not tokenize field called 'element'
if ("element".equals(field)) {
return new CharTokenizer(reader) {
protected boolean isTokenChar(char c) {
return true;
}
};
} else {
// use standard analyzer
return STANDARD.tokenStream(field, reader);
}
}
}
All that being said, most of the heavy lifting in custom analyzers is done by calls to custom subclasses of TokenFilter.
If you want your custom token modification to come after the filters that lucene's StandardAnalyzer class would normally call, do the following:
return new NameFilter(
CaseNumberFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader)
)
), StopAnalyzer.ENGLISH_STOP_WORDS)
)
);
How do I index non Latin characters?
Lucene only uses Java strings, so you normally do not need to care about this. Just remember that you may need to specify an encoding when you read in external strings from e.g. a file (otherwise the system's default encoding will be used). If you really need to recode a String you can use this hack:
String newStr = new String(someString.getBytes("UTF-8"));
How can I index HTML documents?
In order to index HTML documents you need to first parse them to extract text that you want to index from them. Here are some HTML parsers that can help you with that:
An example that uses JavaCC to parse HTML into Lucene Document objects is provided in the Lucene web application demo that comes with the Lucene distribution.
The CyberNeko HTML Parser lets you parse HTML documents. It's relatively easy to remove most of the tags from an HTML document (or all if you want), and then use the ones you left in to help create metadata for your Lucene document. NekoHTML also provides a DOM model for navigating through the HTML.
JTidy cleans up HTML, and can provide a DOM interface to the HTML files through a Java API.
The author of FURL recommends TagSoup.
How can I index XML documents?
In order to index XML documents you need to first parse them to extract text that you want to index from them. Here are some XML parsers that can help you with that:
See article Parsing, indexing, and searching XML with Digester and Lucene.
How can I index OpenOffice.org files?
These files (.sxw, .sxc, etc) are ZIP archives that contain XML files. Uncompress the file using Java's ZIP support, then parse meta.xml to get title etc. and content.xml to get the document's content. Add these to the Lucene index, typically using one Lucene field per property.
Note that this applies to OpenOffice.org 1.x, things have changed a bit for OpenOffice.org 2.x, but the basic approach is still the same.
You can also use LIUS framework for indexing OpenOffice documents( http://www.bibl.ulaval.ca/lius/). LIUS allow metadata and fulltext indexing, using XPath.
How can I index MS-Word documents?
In order to index Word documents you need to first parse them to extract text that you want to index from them. Here are some Word parsers that can help you with that:
Jakarta Apache POI has an early development level Microsoft Word parser for versions of Word from Office 97, 2000, and XP.
How can I index MS-Excel documents?
In order to index Excel documents you need to first parse them to extract text that you want to index from them. Here are some Excel parsers that can help you with that:
Jakarta Apache POI has an excellent Microsoft Excel parser for versions of Excel from Office 97, 2000, and XP. You can also modify Excel files with this tool.
How can I index MS-Powerpoint documents?
In order to index Powerpoint documents you need to first parse them to extract text that you want to index from them. You can use the Jakarta Apache POI, as it contains a parser for Powerpoint documents.
How can I index Email (from MS-Exchange or another IMAP server) ?
Take a look at:
How can I index RTF documents?
In order to index RTF documents you need to first parse them to extract text that you want to index from them. Lucene In Action contains an example of how to do this using the Swing RTFEditorKit class.
How can I index PDF documents?
In order to index PDF documents you need to first parse them to extract text that you want to index from them. Here are some PDF parsers that can help you with that:
PDFBox is a Java API from Ben Litchfield that will let you access the contents of a PDF document. It comes with integration classes for Lucene to translate a PDF into a Lucene document.
XPDF is an open source tool that is licensed under the GPL. It's not a Java tool, but there is a utility called pdftotext that can translate PDF files into text files on most platforms from the command line.
Based on xpdf, there is a utility called pdftohtml that can translate PDF files into HTML files. This is also not a Java application.
JPedal is a Java API for extracting text and images from PDF documents.
How can I index JSP files?
To index the content of JSPs that a user would see using a Web browser, you would need to write an application that acts as a Web client, in order to mimic the Web browser behaviour (i.e. a web crawler). Once you have such an application, you should be able to point it to the desired JSP, retrieve the contents that the JSP generates, parse it, and feed it to Lucene. See list of Open Source Crawlers in Java.
How to parse the output of the JSP depends on the type of content that the JSP generates. In most cases the content is going to be in HTML format.
Most importantly, do not try to index JSPs by treating them as normal files in your file system. In order to index JSPs properly you need to access them via HTTP, acting like a Web client.
If I use a compound file-style index, can I still optimize my index?
Yes. Each .cfs file created in the compound file-style index represents a single segment, which means you can still merge multiple segments into a single segment by optimizing the index.
What is the difference between IndexWriter.addIndexes(IndexReader[]) and IndexWriter.addIndexes(Directory[]), besides them taking different arguments?
When merging lots of indexes (more than the mergeFactor), the Directory-based method will use fewer file handles and less memory, as it will only ever open mergeFactor indexes at once, while the IndexReader-based method requires that all indexes be open when passed.
The primary advantage of the IndexReader-based method is that one can pass it IndexReaders that don't reside in a Directory.
Can I use Lucene to index text in Chinese, Japanese, Korean, and other multi-byte character sets?
Yes, you can. Lucene is not limited to English, nor any other language. To index text properly, you need to use an Analyzer appropriate for the language of the text you are indexing. Lucene's default Analyzers work well for English. There are a number of other Analyzers in Lucene Sandbox, including those for Chinese, Japanese, and Korean.
Why do I have a deletable file (and old segment files remain) after running optimize?
This is normal behavior on Windows whenever you also have readers (IndexReaders or IndexSearchers) open against the index you are optimizing. Lucene tries to remove old segments files once they have been merged (optimized). However, because Windows does not allow removing files that are open for reading, Lucene catches an IOException deleting these files and and then records these pending deletable files into the "deletable" file. On the next segments merge, which happens with explicit optimize() or close() calls and also whenever the IndexWriter flushes its internal RAMDirectory to disk (every IndexWriter.DEFAULT_MAX_BUFFERED_DOCS (default 10) addDocuments), Lucene will try again to delete these files (and additional ones) and any that still fail will be rewritten to the deletable file.
last edited 2006-10-18 22:17:55 by StevenParkes