[ lucene扩展 ] spellChecker原理分析 - MR-fox - 博客园

spellChecker是用来对用户输入的“检索内容”进行校正，例如百度上搜索“麻辣将”，他的提示如下图所示：

我们首先借用lucene简单实现该功能。

本文内容如下（简单实现、原理简介、现有问题）

lucene中spellchecker简述

lucene 的扩展包中包含了spellchecker，利用它我们可以方便的实现拼写检查的功能，但是检查的效果（推荐的准确程度）需要开发者进行调整、优化。

lucene实现“拼写检查”的步骤

步骤1：建立spellchecker所需的索引文件

spellchecker也需要借助lucene的索引实现的，只不过其采用了特殊的分词方式和相关度计算方式。

建立spellchecker所需的索引文件可以用文本文件提供内容，一行一个词组，类似于字典结构。

例如（dic.txt）：

    
            麻辣烫
中文测试
麻辣酱
麻辣火锅
中国人
中华人民共和国

        

建立spellchecker索引的关键代码如下：

    
                 /**
 * 根据字典文件创建spellchecker所使用的索引。
 * 
 * @param spellIndexPath
 *            spellchecker索引文件路径
 * @param idcFilePath
 *            原始字典文件路径
 * @throws IOException
 */
public void createSpellIndex(String spellIndexPath, String idcFilePath)
        throws IOException {
    Directory spellIndexDir = FSDirectory.open(new File(spellIndexPath));
    SpellChecker spellChecker = new SpellChecker(spellIndexDir);
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
            null);
    spellChecker.indexDictionary(new PlainTextDictionary(new File(
            idcFilePath)), config, false);
    // close
    spellIndexDir.close();
    spellChecker.close();
}

        

这里使用了PlainTextDictionary对象，他实现了Dictionary接口，类结构如下图所示：

除了PlainTextDictionary（1 word per line），我们还可以使用：

FileDictionary（1 string per line, optionally with a tab-separated integer value | 词组之间用tab分隔）
LuceneDictionary（Lucene Dictionary: terms taken from the given field of a Lucene index | 用现有的index的term建立索引）
HighFrequencyDictionary（HighFrequencyDictionary: terms taken from the given field of a Lucene index, which appear in a number of documents above a given threshold. | 在LuceneDictionary的基础上加入了一定的限定，term只有出现在各document中的次数满足一定数量时才被spellchecker采用）

例如我们采用luceneDictionary，主要代码如下：

    
            /**
 * 根据指定索引中的字典创建spellchecker所使用的索引。
 * 
 * @param oriIndexPath
 *            指定原始索引
 * @param fieldName
 *            索引字段（某个字段的字典）
 * @param spellIndexPath
 *            原始字典文件路径
 * @throws IOException
 */
public void createSpellIndex(String oriIndexPath, String fieldName,
        String spellIndexPath) throws IOException {
    IndexReader oriIndex = IndexReader.open(FSDirectory.open(new File(
            oriIndexPath)));
    LuceneDictionary dict = new LuceneDictionary(oriIndex, fieldName);
    Directory spellIndexDir = FSDirectory.open(new File(spellIndexPath));
    SpellChecker spellChecker = new SpellChecker(spellIndexDir);
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
            null);
    spellChecker.indexDictionary(dict, config, true);
}

        

我们对dic.txt建立索引后，可以对其内部文档和term进行进一步了解，如下：

    
            Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLY<word:麻辣烫>>
Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLY<word:中文测试>>
Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLY<word:麻辣酱>>
Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLY<word:麻辣火锅>>
Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLY<word:中国人>>
Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLY<word:中华人民共和国>>
end1:人  
end1:烫  end1:试  end1:酱  end1:锅  end2:国人 end2:测试 end2:火锅 end2:辣烫 end2:辣酱 end3:共和国    
end4:民共和国   gram1:中 gram1:人 gram1:国 gram1:文 gram1:测 gram1:火 gram1:烫 gram1:试 gram1:辣 
gram1:酱 gram1:锅 gram1:麻 gram1:  gram2:中国    gram2:中文    gram2:国人    gram2:文测    gram2:测试    gram2:火锅    
gram2:辣火    gram2:辣烫    gram2:辣酱    gram2:麻辣    gram2:麻 gram3:中华人   gram3:人民共   gram3:共和国   gram3:华人民   gram3:民共和   
gram4:中华人民  gram4:人民共和  gram4:华人民共  gram4:民共和国  start1:中    start1:麻    start1: start2:中国   start2:中文   start2:麻辣   
start2:麻    start3:中华人  start4:中华人民 word:中华人民共和国    word:中国人    word:中文测试   word:麻辣火锅   word:麻辣酱    word:麻辣烫    

        

可以看出，每一个词组（dic.txt每一行的内容）被当成一个document，然后采用特殊的分词方式对其进行分词，我们可以看出field的名称比较奇怪，例如：end1，end2，gram1，gram2等等。

为什么这么做，什么原理？我们先留下这个疑问，看完效果后再说明！

步骤二：spellchecker的“检查建议”

我们使用第一步创建的索引，利用spellChecker.suggestSimilar方法进行拼写检查。全部代码如下：

package com.fox.lab;
 
import java.io.File;
import java.io.IOException;
import java.util.Iterator;
 
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.spell.LuceneDictionary;
import org.apache.lucene.search.spell.SpellChecker;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
 
/**
 * @author huangfox
 * @createDate 2012-2-16
 * @eMail [email protected]
 */
public class DidYouMeanSearcher {
    SpellChecker spellChecker = null;
    LuceneDictionary dict = null;
 
    /**
     * 
     * @param spellCheckIndexPath
     *            spellChecker索引位置
     */
    public DidYouMeanSearcher(String spellCheckIndexPath, String oriIndexPath,
            String fieldName) {
        Directory directory;
        try {
            directory = FSDirectory.open(new File(spellCheckIndexPath));
            spellChecker = new SpellChecker(directory);
            IndexReader oriIndex = IndexReader.open(FSDirectory.open(new File(
                    oriIndexPath)));
            dict = new LuceneDictionary(oriIndex, fieldName);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    /**
     * 设定精度，默认0.5
     * 
     * @param v
     */
    public void setAccuracy(float v) {
        spellChecker.setAccuracy(v);
    }
 
    /**
     * 针对检索式进行spell check
     * 
     * @param queryString
     *            检索式
     * @param suggestionsNumber
     *            推荐的最大数量
     * @return
     */
    public String[] search(String queryString, int suggestionsNumber) {
        String[] suggestions = null;
        try {
            // if (exist(queryString))
            // return null;
            suggestions = spellChecker.suggestSimilar(queryString,
                    suggestionsNumber);
        } catch (IOException e) {
            e.printStackTrace();
        }
        return suggestions;
    }
 
    private boolean exist(String queryString) {
        Iterator<String> ite = dict.getWordsIterator();
        while (ite.hasNext()) {
            if (ite.next().equals(queryString))
                return true;
        }
        return false;
    }
}

测试效果：

    
            package com.fox.lab;
 
import java.io.IOException;
 
public class DidYouMeanMainApp {
 
    /**
     * @param args
     */
    public static void main(String[] args) {
        // 创建index
        DidYouMeanIndexer indexer = new DidYouMeanIndexer();
        String spellIndexPath = "D:\\spellchecker";
        String idcFilePath = "D:\\dic.txt";
        String oriIndexPath = "D:\\solrHome\\example\\solr\\data\\index";
        String fieldName = "ab";
        DidYouMeanSearcher searcher = new DidYouMeanSearcher(spellIndexPath,
                oriIndexPath, fieldName);
        searcher.setAccuracy(0.5f);
        int suggestionsNumber = 15;
        String queryString = "麻辣将";
//      try {
//          indexer.createSpellIndex(spellIndexPath, idcFilePath);
        // indexer.createSpellIndex(oriIndexPath, fieldName, spellIndexPath);
        // } catch (IOException e) {
        // e.printStackTrace();
        // }
        String[] result = searcher.search(queryString, suggestionsNumber);
        if (result == null || result.length == 0) {
            System.out.println("我不知道你要什么，或许你就是对的！");
        } else {
            System.out.println("你是不是想找：");
            for (int i = 0; i < result.length; i++) {
                System.out.println(result[i]);
            }
        }
    }
 
}

        

输出：

    
            你是不是想找：
麻辣酱
麻辣火锅
麻辣烫

将queryString改为“中文测式”，输出：

    
            你是不是想找：
中文测试

当输入正确时，例如“中文测试”，则输出：

我不知道你要什么，或许你就是对的！

阅读全文……

标签 : java, lucene, search

发表评论

IT瘾于2015年11月26日上午12时42分00秒发布 #

发表评论发送引用通报

Re: [ lucene扩展 ] spellChecker原理分析 - MR-fox - 博客园 Anonymous于2025年10月29日上午05时01分57秒评论 #
标题
正文	HTML : b, strong, i, em, blockquote, br, p, pre, a href="", ul, ol, li, sub, sup
OpenID Login	(Not me?)
姓名
电子邮件
网站
记住我	是否
电邮地址不会公开在网页上，您留下的电子邮件仅用于本文有新评论时通知您（以后可以随时拿掉）。

[ lucene扩展 ] spellChecker原理分析 - MR-fox - 博客园

Re: [ lucene扩展 ] spellChecker原理分析 - MR-fox - 博客园