<< Java读取UTF-8/UNICODE等字符编码格式的文本文件 | 首页 | BEA WebLogic平台下J2EE调优攻略 >>

Lucene: 忽略指定的字符(Escaping Special Characters)

from javalobby,by R.J. Lorimer

When integrating Lucene into an application so it can directly take user input, it is often valuable to use the QueryParser class. This class is a very handy user-readable-text to functional query converter; perfect for taking user input without a lot of work on your part, but if you don't properly handle special characters, it will fail with a nasty-gram exception:

 

Was expecting one of:
     "(" ...<QUOTED> ... <TERM> ... 
     <PREFIXTERM> ... <WILDTERM> ...  
     "[" ... "{" ... <NUMBER> ...
       at
org.apache.lucene.queryParser.QueryParser.generateParseException(QueryParser.java:1226)
       at org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.java:1109)
       at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:759)
       at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:712)
       at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:122)
  [...]

 

Thankfully, the necessary code to fix this isn't all that difficult. There are two scenarios at this point: 1.) You are using Lucene 1.9 or newer., or 2.) You are using Lucene 1.4 or prior

If you are using Lucene 1.9, the task of escaping user input for the query parser is very straightforward:

Lucene 1.9 Escaping

 

String userQuery = // ...
String escaped = QueryParser.escape(userQuery);
Query query = QueryParser.parse(escaped);
// ...

 

If, however, you are using Lucene 1.4 or prior, there is no escape convenience utility. Instead, you must write your own. The characters that need to be escaped are: + - ! ( ) { } [ ] ^ " ~ * ? : \

Here is a regex-powered block of code that does this (you could also code this using a StringBuffer, indexOf, and all those goodies if you prefer):

Lucene 1.4 Escaping

 

String userInput = // ...
String escapeChars ="[\\\\+\\-\\!\\(\\)\\:\\^\\]\\{\\}\\~\\*\\?]";
String escaped = userInput.replaceAll(escapeChars, "\\\\$0");
Query query = QueryParser.parse(escaped);
// ...

 

The 'escapeChars' string represents all possible characters that should be escaped, and the replaceAll with the $0 says that for whatever character we matched, use it in the replacement and append a '\\' to the front (a backslash).

Now, I always hate those articles on the web that do some hand-waving and over-simplification to explain how easy something is, but don't explain the consequences. In this case, using regular expressions like I have here carries with it some (most likely) unnecessary overhead, and to compress the code into a digestable format, I have performed some less-than-best-practices. If you are going to be escaping this text frequently, I'd recommend you compile the pattern ahead-of-time, and use some constants:

Lucene 1.4 Escaping (More Complete)

 

// Some constants.
private static final String LUCENE_ESCAPE_CHARS = "[\\\\+\\-\\!\\(\\)\\:\\^\\]\\{\\}\\~\\*\\?]";
private static final Pattern LUCENE_PATTERN = Pattern.compile(LUCENE_ESCAPE_CHARS);
private static final String REPLACEMENT_STRING = "\\\\$0";
 
// ... Then, in your code somewhere...
String userInput = // ...
String escaped = LUCENE_PATTERN.matcher(userInput).replaceAll(REPLACEMENT_STRING);
Query query = QueryParser.parse(escaped);
// ...
标签 : ,



发表评论 发送引用通报