[原]基于Lucene多索引进行索引和搜索

标签: | 发表时间:2011-12-13 21:17 | 作者:shirdrn
出处:http://blog.csdn.net/shirdrn

Lucene支持创建多个索引目录,同时存储多个索引。我们可能担心的问题是,在索引的过程中,分散地存储到多个索引目录中,是否在搜索时能够得到全局的相关度计算得分,其实Lucene的ParallelMultiSearcher和MultiSearcher支持全局得分的计算,也就是说,虽然索引分布在多个索引目录中,在搜索的时候还会将全部的索引数据聚合在一起进行查询匹配和得分计算。


索引目录处理


下面我们通过将索引随机地分布到以a~z的26个目录中,并实现一个索引和搜索的程序,来验证一下Lucene得分的计算。

首先,实现一个用来构建索引目录以及处理搜索的工具类,代码如下所示:

package org.shirdrn.lucene;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.search.DefaultSimilarity;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Searchable;
import org.apache.lucene.search.Similarity;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.shirdrn.lucene.MultipleIndexing.IndexWriterObj;

/**
 * Indexing accross multiple Lucene indexes.
 * 
 * @author shirdrn
 * @date   2011-12-12
 */
public class IndexHelper {
	
	private static WriterHelper writerHelper = null;
	private static SearcherHelper searcherHelper = null;
	
	public static WriterHelper newWriterHelper(String root, IndexWriterConfig indexConfig) {
		return WriterHelper.newInstance(root, indexConfig);
	}
	
	public static SearcherHelper newSearcherHelper(String root, IndexWriterConfig indexConfig) {
		return SearcherHelper.newInstance(root, indexConfig);
	}

	protected static class WriterHelper {
		private String alphabet = "abcdefghijklmnopqrstuvwxyz";
		private Lock locker = new ReentrantLock();
		private String indexRootDir = null;
		private IndexWriterConfig indexConfig;
		private Map<Character, IndexWriterObj> indexWriters = new HashMap<Character, IndexWriterObj>();
		private static Random random = new Random();
		private WriterHelper() {
			
		}
		private synchronized static WriterHelper newInstance(String root, IndexWriterConfig indexConfig) {
			if(writerHelper==null) {
				writerHelper = new WriterHelper();
				writerHelper.indexRootDir = root;
				writerHelper.indexConfig = indexConfig;
			}
			return writerHelper;
		}
		public IndexWriterObj selectIndexWriter() {
			int pos = random.nextInt(alphabet.length());
			char ch = alphabet.charAt(pos);
			String dir = new String(new char[] {ch});
			locker.lock();
			try {
				File path = new File(indexRootDir, dir);
				if(!path.exists()) {
					path.mkdir();
				}
				if(!indexWriters.containsKey(ch)) {
					IndexWriter indexWriter = new IndexWriter(FSDirectory.open(path), indexConfig.getAnalyzer(), MaxFieldLength.UNLIMITED);
					indexWriters.put(ch, new IndexWriterObj(indexWriter, dir));
				}
			} catch (CorruptIndexException e) {
				e.printStackTrace();
			} catch (LockObtainFailedException e) {
				e.printStackTrace();
			} catch (IOException e) {
				e.printStackTrace();
			} finally {
				locker.unlock();
			}
			return indexWriters.get(ch);
		}
		@SuppressWarnings("deprecation")
		public void closeAll(boolean autoOptimize) {
			Iterator<Map.Entry<Character, IndexWriterObj>> iter = indexWriters.entrySet().iterator();
			while(iter.hasNext()) {
				Map.Entry<Character, IndexWriterObj> entry = iter.next();
				try {
					if(autoOptimize) {
						entry.getValue().indexWriter.optimize();
					}
					entry.getValue().indexWriter.close();
				} catch (CorruptIndexException e) {
					e.printStackTrace();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
		}
	}
	
	protected static class SearcherHelper {
		private List<IndexSearcher> searchers = new ArrayList<IndexSearcher>();
		private Similarity similarity = new DefaultSimilarity();
		private SearcherHelper() {
			
		}
		private synchronized static SearcherHelper newInstance(String root, IndexWriterConfig indexConfig) {
			if(searcherHelper==null) {
				searcherHelper = new SearcherHelper();
				if(indexConfig.getSimilarity()!=null) {
					searcherHelper.similarity = indexConfig.getSimilarity();
				}
				File indexRoot = new File(root);
				File[] files = indexRoot.listFiles();
				for(File f : files) {
					IndexSearcher searcher = null;
					try {
						searcher = new IndexSearcher(FSDirectory.open(f));
					} catch (CorruptIndexException e) {
						e.printStackTrace();
					} catch (IOException e) {
						e.printStackTrace();
					}
					if(searcher!=null) {
						searcher.setSimilarity(searcherHelper.similarity);
						searcherHelper.searchers.add(searcher);
					}
				}
			}
			return searcherHelper;
		}
		public void closeAll() {
			Iterator<IndexSearcher> iter = searchers.iterator();
			while(iter.hasNext()) {
				try {
					iter.next().close();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
		}
		public Searchable[] getSearchers() {
			Searchable[] a = new Searchable[searchers.size()];
			return searchers.toArray(a);
		}
	}
}
由于在索引的时候,同时打开了多个Directory实例,而每个Directory对应一个IndexWriter,我们通过记录a~z这26个字母为每个IndexWriter的名字,将IndexWriter和目录名称包裹在IndexWriterObj类的对象中,便于通过日志看到实际数据的分布。在进行Lucene Document构建的时候,将这个索引目录的名字(a~z字符中之一)做成一个Field。在索引的时候,值需要调用IndexHelper.WriterHelper的selectIndexWriter()方法,即可以自动选择对应的IndexWriter实例去进行索引。

在搜索的时候,通过IndexHelper.SearcherHelper工具来获取多个Searchable实例的数组,调用getSearchers()即可以获取到,提供给MultiSearcher构建搜索。


索引实现


我们的数据源,选择根据指定的查询条件直接从MongoDB中读取,所以有关处理与MongoDB进行交互的代码都封装到了处理索引的代码中,通过内部类实现。我们看一下,执行索引数据的实现,代码如下所示:

package org.shirdrn.lucene;

import java.io.IOException;
import java.io.Serializable;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.concurrent.atomic.AtomicInteger;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.shirdrn.lucene.IndexHelper.WriterHelper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.mongodb.BasicDBObject;
import com.mongodb.DBCollection;
import com.mongodb.DBCursor;
import com.mongodb.Mongo;
import com.mongodb.MongoException;

/**
 * Indexing accross multiple Lucene indexes.
 * 
 * @author shirdrn
 * @date   2011-12-12
 */
public class MultipleIndexing {

	private static Logger LOG = LoggerFactory.getLogger(MultipleIndexing.class);
	private DBCollection pageColletion;
	private WriterHelper writerHelper;
	private Map<IndexWriter, IntCounter> docCountPerIndexWriter = new HashMap<IndexWriter, IntCounter>();
	private int maxIndexCommitCount = 100;
	private AtomicInteger docCounter = new AtomicInteger();

	public MultipleIndexing(String indexRoot, int maxIndexCommitCount, MongoConfig mongoConfig, IndexWriterConfig indexConfig) {
		super();
		if (maxIndexCommitCount != 0) {
			this.maxIndexCommitCount = maxIndexCommitCount;
		}
		pageColletion = MongoHelper.newHelper(mongoConfig).getCollection(mongoConfig.collectionName);
		writerHelper = IndexHelper.newWriterHelper(indexRoot, indexConfig);
	}

	/**
	 * Indexing
	 * @param conditions
	 */
	public void index(Map<String, Object> conditions) {
		DBCursor cursor = pageColletion.find(new BasicDBObject(conditions));
		try {
			while (cursor.hasNext()) {
				try {
					IndexWriterObj obj = writerHelper.selectIndexWriter();
					Document document = encapsulate(cursor.next().toMap(), obj.name);
					obj.indexWriter.addDocument(document);
					docCounter.addAndGet(1);
					LOG.info("Global docCounter: " + docCounter.get());
					increment(obj.indexWriter);
					checkCommit(obj.indexWriter);
				} catch (MongoException e) {
					e.printStackTrace();
				} catch (CorruptIndexException e) {
					e.printStackTrace();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
			finallyCommitAll();
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			cursor.close();
			writerHelper.closeAll(true);
			LOG.info("Close all indexWriters.");
		}
	}

	private void finallyCommitAll() throws Exception {
		Iterator<IndexWriter> iter = docCountPerIndexWriter.keySet().iterator();
		while(iter.hasNext()) {
			iter.next().commit();
		}
	}

	private void checkCommit(IndexWriter indexWriter) throws Exception {
		if(docCountPerIndexWriter.get(indexWriter).value%maxIndexCommitCount==0) {
			indexWriter.commit();
			LOG.info("Commit: " + indexWriter + ", " + docCountPerIndexWriter.get(indexWriter).value);
		}
	}

	private void increment(IndexWriter indexWriter) {
		IntCounter counter = docCountPerIndexWriter.get(indexWriter);
		if (counter == null) {
			counter = new IntCounter(1);
			docCountPerIndexWriter.put(indexWriter, counter);
		} else {
			++counter.value;
		}
	}

	@SuppressWarnings("unchecked")
	private Document encapsulate(Map map, String path) {
		String title = (String) map.get("title");
		String content = (String) map.get("content");
		String url = (String) map.get("url");
		Document doc = new Document();
		doc.add(new Field(FieldName.TITLE, title, Store.YES, Index.ANALYZED_NO_NORMS));
		doc.add(new Field(FieldName.CONTENT, content, Store.NO, Index.ANALYZED));
		doc.add(new Field(FieldName.URL, url, Store.YES, Index.NOT_ANALYZED_NO_NORMS));
		doc.add(new Field(FieldName.PATH, path, Store.YES, Index.NOT_ANALYZED_NO_NORMS));
		return doc;
	}

	protected interface FieldName {
		public static final String TITLE = "title";
		public static final String CONTENT = "content";
		public static final String URL = "url";
		public static final String PATH = "path";
	}

	protected class IntCounter {
		public IntCounter(int value) {
			super();
			this.value = value;
		}
		private int value;
	}
	
	protected static class IndexWriterObj {
		IndexWriter indexWriter;
		String name;
		public IndexWriterObj(IndexWriter indexWriter, String name) {
			super();
			this.indexWriter = indexWriter;
			this.name = name;
		}
		@Override
		public String toString() {
			return "[" + name + "]";
		}
	}
	
	public static class MongoConfig implements Serializable {
		private static final long serialVersionUID = -3028092758346115702L;
		private String host;
		private int port;
		private String dbname;
		private String collectionName;

		public MongoConfig(String host, int port, String dbname, String collectionName) {
			super();
			this.host = host;
			this.port = port;
			this.dbname = dbname;
			this.collectionName = collectionName;
		}

		@Override
		public boolean equals(Object obj) {
			MongoConfig other = (MongoConfig) obj;
			return host.equals(other.host) && port == other.port && dbname.equals(other.dbname) && collectionName.equals(other.collectionName);
		}
	}

	protected static class MongoHelper {
		private static Mongo mongo;
		private static MongoHelper helper;
		private MongoConfig mongoConfig;

		private MongoHelper(MongoConfig mongoConfig) {
			super();
			this.mongoConfig = mongoConfig;
		}

		public synchronized static MongoHelper newHelper(MongoConfig mongoConfig) {
			try {
				if (helper == null) {
					helper = new MongoHelper(mongoConfig);
					mongo = new Mongo(mongoConfig.host, mongoConfig.port);
					Runtime.getRuntime().addShutdownHook(new Thread() {
						@Override
						public void run() {
							if (mongo != null) {
								mongo.close();
							}
						}
					});
				}
			} catch (Exception e) {
				e.printStackTrace();
			}
			return helper;
		}

		public DBCollection getCollection(String collectionName) {
			DBCollection c = null;
			try {
				c = mongo.getDB(mongoConfig.dbname).getCollection(collectionName);
			} catch (Exception e) {
				e.printStackTrace();
			}
			return c;
		}
	}
}
上面代码是基于单线程的,如果你的应用具有海量的数据,这种方式势必会影响索引的吞吐量。不过基于上述代码很容易将其改造成多线程的,基本思路就是:将数据源在内存中适当缓存,然后基于生产者-消费者模型,启动多个消费线程去获取数据同时并发地向IndexWriter推送数据(尤其需要注意的是,在同一个IndexWriter实例上不要进行同步,否则容易造成死锁,因为IndexWriter是线程安全的)。

另外需要说明一点,有关索引数据分布和更新的问题。基于上述随机选择索引目录,在一定程度上能够均匀地将数据分布到不同的目录中,但是在更新的时候,如果处理不当会造成数据的重复(因为随机),解决重复的方法就是在外部增加重复检测工作,限制将重复(非常相似)的文档再次进行索引。

下面我们看一下索引的测试用例,代码如下所示:

package org.shirdrn.lucene;

import java.util.HashMap;
import java.util.Map;

import junit.framework.TestCase;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.util.Version;
import org.shirdrn.lucene.MultipleIndexing.MongoConfig;

public class TestMultipleIndexing extends TestCase {

	MultipleIndexing indexer;
	
	@Override
	protected void setUp() throws Exception {
		MongoConfig mongoConfig = new MongoConfig("192.168.0.184", 27017, "page", "Article");
		String indexRoot = "E:\\Store\\indexes";
		int maxIndexCommitCount = 200;
		Analyzer a = new SmartChineseAnalyzer(Version.LUCENE_35, true);
		IndexWriterConfig indexConfig = new IndexWriterConfig(Version.LUCENE_35, a);
		indexConfig.setOpenMode(OpenMode.CREATE);
		indexer = new MultipleIndexing(indexRoot, maxIndexCommitCount, mongoConfig, indexConfig);
	}
	
	@Override
	protected void tearDown() throws Exception {
		super.tearDown();
	}
	
	public void testIndexing() {
		Map<String, Object> conditions = new HashMap<String, Object>();
		conditions.put("spiderName", "sinaSpider");
		indexer.index(conditions);
	}
}
我这里,索引了9w多篇文档,生成的索引还算均匀地分布在名称为a~z的26个目录中。

搜索实现


在搜索的时候,你可以选择ParallelMultiSearcher或MultiSearcher的任意一个,MultiSearcher是在搜索时候,通过一个循环来遍历多个索引获取到检索结果,而ParallelMultiSearcher则是启动多个线程并行执行搜索,使用它们的效率在不同配置的机器上效果是不同的,在实际使用的时候根据你的需要来决定。我简单地使用了MultiSearcher来构建搜索,实现代码如下所示:

package org.shirdrn.lucene;

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.MultiSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.util.Version;
import org.shirdrn.lucene.IndexHelper.SearcherHelper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * Searching accross multiple Lucene indexes.
 * 
 * @author shirdrn
 * @date   2011-12-12
 */
public class MultipleSearching {

	private static Logger LOG = LoggerFactory.getLogger(MultipleSearching.class);
	private SearcherHelper searcherHelper;
	private Searcher searcher;
	private QueryParser queryParser;
	private IndexWriterConfig indexConfig;
	
	private Query query;
	private ScoreDoc[] scoreDocs;
	
	public MultipleSearching(String indexRoot, IndexWriterConfig indexConfig) {
		searcherHelper = IndexHelper.newSearcherHelper(indexRoot, indexConfig);
		this.indexConfig = indexConfig;
		try {
			searcher = new MultiSearcher(searcherHelper.getSearchers());
			searcher.setSimilarity(indexConfig.getSimilarity());
			queryParser = new QueryParser(Version.LUCENE_35, "content", indexConfig.getAnalyzer());
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
	
	public void search(String queries) {
		try {
			query = queryParser.parse(queries);
			TopScoreDocCollector collector = TopScoreDocCollector.create(100000, true);
			searcher.search(query, collector);
			scoreDocs = collector.topDocs().scoreDocs;
		} catch (ParseException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
	
	public void iterateDocs(int start, int end) {
		for (int i = start; i < Math.min(scoreDocs.length, end); i++) {
			try {
				LOG.info(searcher.doc(scoreDocs[i].doc).toString());
			} catch (CorruptIndexException e) {
				e.printStackTrace();
			} catch (IOException e) {
				e.printStackTrace();
			}
		}
	}
	
	public void explain(int start, int end) {
		for (int i = start; i < Math.min(scoreDocs.length, end); i++) {
			try {
				System.out.println(searcher.explain(query, scoreDocs[i].doc));
			} catch (CorruptIndexException e) {
				e.printStackTrace();
			} catch (IOException e) {
				e.printStackTrace();
			}
		}
	}
	
	public void close() {
		searcherHelper.closeAll();
		try {
			searcher.close();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}
我们的一个目的是查看搜索是否是在多个索引目录上进行检索,并且最终相关度排序是基于多个索引计算的。上面给出了一个更接近测试用例的实现,iterateDocs()迭代出文档并输出,explain()方法查看得分计算明细。
下面给出搜索的测试用例,代码如下所示:

package org.shirdrn.lucene;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.util.Version;

import junit.framework.TestCase;

public class TestMultipleSearching extends TestCase {

	MultipleSearching searcher;
	
	@Override
	protected void setUp() throws Exception {
		String indexRoot = "E:\\Store\\indexes";
		Analyzer a = new SmartChineseAnalyzer(Version.LUCENE_35, true);
		IndexWriterConfig indexConfig = new IndexWriterConfig(Version.LUCENE_35, a);
		searcher = new MultipleSearching(indexRoot, indexConfig);
	}
	
	@Override
	protected void tearDown() throws Exception {
		searcher.close();
	}
	
	public void testSearching() {
		searcher.search("+title:拉斯维加斯^1.25 (+content:美国^1.50 +content:拉斯维加斯)");
		searcher.iterateDocs(0, 10);
		searcher.explain(0, 5);
	}
}
搜索结果,迭代出来的文档数据信息,如下所示:

2011-12-13 12:12:55 org.shirdrn.lucene.MultipleSearching iterateDocs
信息: Document<stored,indexed,tokenized,omitNorms<title:全新体验 拉斯维加斯的完美24小时(组图)(4)_新浪旅游_新浪网> stored,indexed,omitNorms<url:http://travel.sina.com.cn/world/2010-08-16/1400141443_4.shtml> stored,indexed,omitNorms<path:x>>
2011-12-13 12:12:55 org.shirdrn.lucene.MultipleSearching iterateDocs
信息: Document<stored,indexed,tokenized,omitNorms<title:拉斯维加斯 触摸你的奢侈底线(组图)_新浪旅游_新浪网> stored,indexed,omitNorms<url:http://travel.sina.com.cn/world/2009-05-21/095684952.shtml> stored,indexed,omitNorms<path:v>>
2011-12-13 12:12:55 org.shirdrn.lucene.MultipleSearching iterateDocs
信息: Document<stored,indexed,tokenized,omitNorms<title:美国拉斯维加斯地图_新浪旅游_新浪网> stored,indexed,omitNorms<url:http://travel.sina.com.cn/world/2008-08-20/113317460.shtml> stored,indexed,omitNorms<path:a>>
2011-12-13 12:12:55 org.shirdrn.lucene.MultipleSearching iterateDocs
信息: Document<stored,indexed,tokenized,omitNorms<title:美国拉斯维加斯:潮野水上乐园_新浪旅游_新浪网> stored,indexed,omitNorms<url:http://travel.sina.com.cn/world/2008-08-20/093217358.shtml> stored,indexed,omitNorms<path:e>>
2011-12-13 12:12:55 org.shirdrn.lucene.MultipleSearching iterateDocs
信息: Document<stored,indexed,tokenized,omitNorms<title:美国拉斯维加斯:米高梅历险游乐园_新浪旅游_新浪网> stored,indexed,omitNorms<url:http://travel.sina.com.cn/world/2008-08-20/095617381.shtml> stored,indexed,omitNorms<path:k>>
2011-12-13 12:12:55 org.shirdrn.lucene.MultipleSearching iterateDocs
信息: Document<stored,indexed,tokenized,omitNorms<title:美国拉斯维加斯主要景点_新浪旅游_新浪网> stored,indexed,omitNorms<url:http://travel.sina.com.cn/world/2008-08-20/114817479.shtml> stored,indexed,omitNorms<path:m>>
2011-12-13 12:12:55 org.shirdrn.lucene.MultipleSearching iterateDocs
信息: Document<stored,indexed,tokenized,omitNorms<title:娱乐之都拉斯维加斯宣布在中国推旅游市场新战略_新浪旅游_新浪网> stored,indexed,omitNorms<url:http://travel.sina.com.cn/news/2008-11-19/094337435.shtml> stored,indexed,omitNorms<path:j>>
2011-12-13 12:12:55 org.shirdrn.lucene.MultipleSearching iterateDocs
信息: Document<stored,indexed,tokenized,omitNorms<title:美国拉斯维加斯简介_新浪旅游_新浪网> stored,indexed,omitNorms<url:http://travel.sina.com.cn/world/2008-08-19/160017116.shtml> stored,indexed,omitNorms<path:v>>
2011-12-13 12:12:55 org.shirdrn.lucene.MultipleSearching iterateDocs
信息: Document<stored,indexed,tokenized,omitNorms<title:拉斯维加斯“猫王模仿秀”亮相国际旅交会_新浪旅游_新浪网> stored,indexed,omitNorms<url:http://travel.sina.com.cn/news/2009-11-23/1004116788.shtml> stored,indexed,omitNorms<path:j>>
2011-12-13 12:12:55 org.shirdrn.lucene.MultipleSearching iterateDocs
信息: Document<stored,indexed,tokenized,omitNorms<title:10大美食家的饕餮名城:拉斯维加斯(图)(4)_新浪旅游_新浪网> stored,indexed,omitNorms<url:http://travel.sina.com.cn/food/2009-01-16/090855088.shtml> stored,indexed,omitNorms<path:s>>

通过path可以看到,搜索结果是将多个索引目录下的结果进行了聚合。

下面是搜索结果相关度得分情况,如下所示:

7.3240967 = (MATCH) sum of:
  6.216492 = (MATCH) weight(title:拉斯维加斯^1.25 in 616), product of:
    0.7747233 = queryWeight(title:拉斯维加斯^1.25), product of:
      1.25 = boost
      8.024145 = idf(docFreq=82, maxDocs=93245)
      0.07723921 = queryNorm
    8.024145 = (MATCH) fieldWeight(title:拉斯维加斯 in 616), product of:
      1.0 = tf(termFreq(title:拉斯维加斯)=1)
      8.024145 = idf(docFreq=82, maxDocs=93245)
      1.0 = fieldNorm(field=title, doc=616)
  1.1076047 = (MATCH) sum of:
    0.24895692 = (MATCH) weight(content:美国^1.5 in 616), product of:
      0.39020002 = queryWeight(content:美国^1.5), product of:
        1.5 = boost
        3.3678925 = idf(docFreq=8734, maxDocs=93245)
        0.07723921 = queryNorm
      0.63802385 = (MATCH) fieldWeight(content:美国 in 616), product of:
        1.7320508 = tf(termFreq(content:美国)=3)
        3.3678925 = idf(docFreq=8734, maxDocs=93245)
        0.109375 = fieldNorm(field=content, doc=616)
    0.8586478 = (MATCH) weight(content:拉斯维加斯 in 616), product of:
      0.49754182 = queryWeight(content:拉斯维加斯), product of:
        6.4415708 = idf(docFreq=403, maxDocs=93245)
        0.07723921 = queryNorm
      1.7257802 = (MATCH) fieldWeight(content:拉斯维加斯 in 616), product of:
        2.4494898 = tf(termFreq(content:拉斯维加斯)=6)
        6.4415708 = idf(docFreq=403, maxDocs=93245)
        0.109375 = fieldNorm(field=content, doc=616)

7.2405667 = (MATCH) sum of:
  6.216492 = (MATCH) weight(title:拉斯维加斯^1.25 in 2850), product of:
    0.7747233 = queryWeight(title:拉斯维加斯^1.25), product of:
      1.25 = boost
      8.024145 = idf(docFreq=82, maxDocs=93245)
      0.07723921 = queryNorm
    8.024145 = (MATCH) fieldWeight(title:拉斯维加斯 in 2850), product of:
      1.0 = tf(termFreq(title:拉斯维加斯)=1)
      8.024145 = idf(docFreq=82, maxDocs=93245)
      1.0 = fieldNorm(field=title, doc=2850)
  1.0240744 = (MATCH) sum of:
    0.17423354 = (MATCH) weight(content:美国^1.5 in 2850), product of:
      0.39020002 = queryWeight(content:美国^1.5), product of:
        1.5 = boost
        3.3678925 = idf(docFreq=8734, maxDocs=93245)
        0.07723921 = queryNorm
      0.44652367 = (MATCH) fieldWeight(content:美国 in 2850), product of:
        1.4142135 = tf(termFreq(content:美国)=2)
        3.3678925 = idf(docFreq=8734, maxDocs=93245)
        0.09375 = fieldNorm(field=content, doc=2850)
    0.8498409 = (MATCH) weight(content:拉斯维加斯 in 2850), product of:
      0.49754182 = queryWeight(content:拉斯维加斯), product of:
        6.4415708 = idf(docFreq=403, maxDocs=93245)
        0.07723921 = queryNorm
      1.7080793 = (MATCH) fieldWeight(content:拉斯维加斯 in 2850), product of:
        2.828427 = tf(termFreq(content:拉斯维加斯)=8)
        6.4415708 = idf(docFreq=403, maxDocs=93245)
        0.09375 = fieldNorm(field=content, doc=2850)

7.1128473 = (MATCH) sum of:
  6.216492 = (MATCH) weight(title:拉斯维加斯^1.25 in 63), product of:
    0.7747233 = queryWeight(title:拉斯维加斯^1.25), product of:
      1.25 = boost
      8.024145 = idf(docFreq=82, maxDocs=93245)
      0.07723921 = queryNorm
    8.024145 = (MATCH) fieldWeight(title:拉斯维加斯 in 63), product of:
      1.0 = tf(termFreq(title:拉斯维加斯)=1)
      8.024145 = idf(docFreq=82, maxDocs=93245)
      1.0 = fieldNorm(field=title, doc=63)
  0.896355 = (MATCH) sum of:
    0.1451946 = (MATCH) weight(content:美国^1.5 in 63), product of:
      0.39020002 = queryWeight(content:美国^1.5), product of:
        1.5 = boost
        3.3678925 = idf(docFreq=8734, maxDocs=93245)
        0.07723921 = queryNorm
      0.37210304 = (MATCH) fieldWeight(content:美国 in 63), product of:
        1.4142135 = tf(termFreq(content:美国)=2)
        3.3678925 = idf(docFreq=8734, maxDocs=93245)
        0.078125 = fieldNorm(field=content, doc=63)
    0.7511604 = (MATCH) weight(content:拉斯维加斯 in 63), product of:
      0.49754182 = queryWeight(content:拉斯维加斯), product of:
        6.4415708 = idf(docFreq=403, maxDocs=93245)
        0.07723921 = queryNorm
      1.5097432 = (MATCH) fieldWeight(content:拉斯维加斯 in 63), product of:
        3.0 = tf(termFreq(content:拉斯维加斯)=9)
        6.4415708 = idf(docFreq=403, maxDocs=93245)
        0.078125 = fieldNorm(field=content, doc=63)

7.1128473 = (MATCH) sum of:
  6.216492 = (MATCH) weight(title:拉斯维加斯^1.25 in 2910), product of:
    0.7747233 = queryWeight(title:拉斯维加斯^1.25), product of:
      1.25 = boost
      8.024145 = idf(docFreq=82, maxDocs=93245)
      0.07723921 = queryNorm
    8.024145 = (MATCH) fieldWeight(title:拉斯维加斯 in 2910), product of:
      1.0 = tf(termFreq(title:拉斯维加斯)=1)
      8.024145 = idf(docFreq=82, maxDocs=93245)
      1.0 = fieldNorm(field=title, doc=2910)
  0.896355 = (MATCH) sum of:
    0.1451946 = (MATCH) weight(content:美国^1.5 in 2910), product of:
      0.39020002 = queryWeight(content:美国^1.5), product of:
        1.5 = boost
        3.3678925 = idf(docFreq=8734, maxDocs=93245)
        0.07723921 = queryNorm
      0.37210304 = (MATCH) fieldWeight(content:美国 in 2910), product of:
        1.4142135 = tf(termFreq(content:美国)=2)
        3.3678925 = idf(docFreq=8734, maxDocs=93245)
        0.078125 = fieldNorm(field=content, doc=2910)
    0.7511604 = (MATCH) weight(content:拉斯维加斯 in 2910), product of:
      0.49754182 = queryWeight(content:拉斯维加斯), product of:
        6.4415708 = idf(docFreq=403, maxDocs=93245)
        0.07723921 = queryNorm
      1.5097432 = (MATCH) fieldWeight(content:拉斯维加斯 in 2910), product of:
        3.0 = tf(termFreq(content:拉斯维加斯)=9)
        6.4415708 = idf(docFreq=403, maxDocs=93245)
        0.078125 = fieldNorm(field=content, doc=2910)

7.1128473 = (MATCH) sum of:
  6.216492 = (MATCH) weight(title:拉斯维加斯^1.25 in 2920), product of:
    0.7747233 = queryWeight(title:拉斯维加斯^1.25), product of:
      1.25 = boost
      8.024145 = idf(docFreq=82, maxDocs=93245)
      0.07723921 = queryNorm
    8.024145 = (MATCH) fieldWeight(title:拉斯维加斯 in 2920), product of:
      1.0 = tf(termFreq(title:拉斯维加斯)=1)
      8.024145 = idf(docFreq=82, maxDocs=93245)
      1.0 = fieldNorm(field=title, doc=2920)
  0.896355 = (MATCH) sum of:
    0.1451946 = (MATCH) weight(content:美国^1.5 in 2920), product of:
      0.39020002 = queryWeight(content:美国^1.5), product of:
        1.5 = boost
        3.3678925 = idf(docFreq=8734, maxDocs=93245)
        0.07723921 = queryNorm
      0.37210304 = (MATCH) fieldWeight(content:美国 in 2920), product of:
        1.4142135 = tf(termFreq(content:美国)=2)
        3.3678925 = idf(docFreq=8734, maxDocs=93245)
        0.078125 = fieldNorm(field=content, doc=2920)
    0.7511604 = (MATCH) weight(content:拉斯维加斯 in 2920), product of:
      0.49754182 = queryWeight(content:拉斯维加斯), product of:
        6.4415708 = idf(docFreq=403, maxDocs=93245)
        0.07723921 = queryNorm
      1.5097432 = (MATCH) fieldWeight(content:拉斯维加斯 in 2920), product of:
        3.0 = tf(termFreq(content:拉斯维加斯)=9)
        6.4415708 = idf(docFreq=403, maxDocs=93245)
        0.078125 = fieldNorm(field=content, doc=2920)

可见,搜索结果相关度得分,是基于全部的多个索引来计算的(maxDocs=93245)。

作者:shirdrn 发表于2011/12/13 13:17:26 原文链接
阅读:5835 评论:4 查看评论

相关 [lucene 索引 索引] 推荐:

有关Lucene的问题(7):用Lucene构建实时的索引

- -
由于前一章所述的Lucene的事务性,使得Lucene可以增量的添加一个段,我们知道,倒排索引是有一定的格式的,而这个格式一旦写入是非常难以改变的,那么如何能够增量建索引呢. Lucene使用段这个概念解决了这个问题,对于每个已经生成的段,其倒排索引结构不会再改变,而增量添加的文档添加到新的段中,段之间在一定的时刻进行合并,从而形成新的倒排索引结构.

lucene索引创建的理解思路

- - ITeye博客
虽然lucene4很早就出来,但是这里仍然以lucene3.0为基础,理解lucene索引创建的思路:. field的数据,fdx,fdt,依次写每个field的即可. 词向量,tvx,tvd,tvf. tvf是真正存储的地方,tvx是每个文档一项,具体包含第一个field的位置,其他field只要记录与覅一个field的偏移量即可.

[原]Lucene系列-索引文件

- - 文武天下
本文介绍下lucene生成的索引有哪些文件组成,每个文件包含了什么信息. 基于Lucene 4.10.0. 索引(index)包含了存储的文档(document)正排、倒排信息,用于文本搜索. 索引又分为多个段(segments),每个新添加的doc都会存到一个新segment中,不同的segments又会合并成一个segment.

减小lucene索引体积大小

- - sling2007的博客
下文讲述了lucene中,如何优化索引,减小索引体积. 如果需要被搜索的数值类型,需要设置合适的precisionstep. 如果不需要搜索,只要排序即可,那么设置precisionstep为Integer.Max即可. 使用geohash算法,给每个区域编码,把编码切成term并索引,然后用于搜索.

Lucene索引升级 - rainystars' Blog - SegmentFault

- -
由于Lucene文件格式从2到3以及从3到4版本时都发生了重大的改变,造成了高版本无法读取低版本的数据,使用Lucene中的IndexUpgrader方法先将版本从2升到3,然后再从3升级到4. 从版本2升级到版本3时,需要使用lucene3的jar包,我使用的lucene3.6的jar包,我需要处理的索引是在一个文件夹中所存在的一系列索引文件,所以需要循环来遍历每个目录.

[原]基于Lucene多索引进行索引和搜索

- - 千与的专栏
Lucene支持创建多个索引目录,同时存储多个索引. 我们可能担心的问题是,在索引的过程中,分散地存储到多个索引目录中,是否在搜索时能够得到全局的相关度计算得分,其实Lucene的ParallelMultiSearcher和MultiSearcher支持全局得分的计算,也就是说,虽然索引分布在多个索引目录中,在搜索的时候还会将全部的索引数据聚合在一起进行查询匹配和得分计算.

LIRE(Lucene Image Retrieval)相似图像索引和搜索机制

- - CSDN博客云计算推荐文章
众说周知,lucene是一个开源的强大的索引工具,但是它仅限于文本索引. 基于内容的图像检索(CBIR)要求我们利用图像的一些基本特征(如颜色纹理形状以及sift,surf等等)搜索相似的图片,LIRE(Lucene Image Retrieval)是一款基于lucene的图像特征索引工具,它能帮助我们方便的对图像特征建立索引和搜索,作者也在不断加入新的特征供用户使用.

开源搜索引擎评估:lucene sphinx elasticsearch

- - 鲁塔弗的博客
lucene系,java开发,包括 solr和 elasticsearch. sphinx,c++开发,简单高性能. 搜索引擎程序这个名称不妥当,严格说来应该叫做 索引程序(indexing program),早期主要用来做中文全文搜索,但是随着互联网的深入普及,各家网站规模越来越大,索引程序在 优化网站架构上发挥了更大的作用: 替代mysql数据库 内置的索引.

用Lucene构建实时索引的文档更新问题

- - 开源软件 - ITeye博客
1、Lucene删除文档的几种方式. IndexReader.deleteDocument(int docID)是用 IndexReader 按文档号删除. IndexReader.deleteDocuments(Term  term)是用 IndexReader 删除包含此词(Term)的文档. IndexWriter.deleteDocuments(Term  term)是用 IndexWriter 删除包含此词(Term)的文档.

主流全文索引工具的比较( Lucene, Sphinx, solr, elastic search)

- - 企业架构 - ITeye博客
前几天的调研(  Rails3下的 full text search (全文本搜索, 全文匹配. ) ), 我发现了两个不错的候选: . lucene  (solr, elasticsearch 都是基于它) . 把看到的有价值的文章记录在这里: . 回答1.  Result relevance ranking is the default.