pagerank 与 相关度
我总是能搜索到我以前整理的文章
--------------------------------------------------------
http://www.douban.com/note/74801405/
我一直很困惑 pagerank 和 相关度怎么做整合
晚上开始蒸腾搜索 研究了一下 摘录一点
虽然每个搜索引擎都严格保密各自的明确的搜索算法,但是搜索引擎分析人士相信搜索引擎结果(排名列表)是“Page Relevance”与“PageRank”
Ranking = (Page Relevance) x (PageRank)
..........
如果在Google上进行广泛搜索,看起来好象有几千个结果,但实际显示最多前1,000项结果。例如对“car rental”,显示搜索结果为5,110,000,但实际显示结果只有826个。而且用时只有0.81秒。试想一下,0.84秒的时间就可以计算这五百万搜索结果的每个排名因子得分,然后给出最终我们所看到的网站排名结果吗?
答案就在于:搜索引擎选取与查询条件最相关的那些网页形成一个子集来加速搜索的速度。例如:假设子集中包含2,000个元素,搜索引擎所做的就是使用排名因子中的两到三个因素对整个数据库进行查询,找到针对这两三个排名因子得分较高的前2,000个网页。(请记住,虽然可能有五百多万搜索结果,但最终实际显示的1,000项搜索结果却是从这个2,000页的子集中提炼出来的。) 然后搜索引擎再把所有排名因子整合进这2,000项搜索结果组成的子集中并进行相应的网站排名。由于按相性进行排序,子集中越靠后的搜索结果(不是指网页)相关性(质量)也就越低,所以搜索引擎只向用户显示与查询条件最相关的前1,000项搜索结果。
--------------------------------------------------------
那如何在xpain中做到这一点呢 ?
参见
http://lists.xapian.org/pipermail/xapian-discuss/2008-December/006258.html
[Xapian-discuss] Xapian's scoring/sorting compared to Google's]]
里面提到可以用postingsource来做
http://xapian.org/docs/postingsource.html
Examples
Here is an example of a Python PostingSource which contributes additional weight from some external source:
class ExternalWeightPostingSource(xapian.PostingSource):
"""
A Xapian posting source returning weights from an external source.
"""
def __init__(self, db, wtsource):
xapian.PostingSource.__init__(self)
self.db = db
self.wtsource = wtsource
def init(self, db):
self.alldocs = db.postlist('')
def get_termfreq_min(self): return 0
def get_termfreq_est(self): return self.db.get_doccount()
def get_termfreq_max(self): return self.db.get_doccount()
def next(self, minweight):
try:
self.current = self.alldocs.next()
except StopIteration:
self.current = None
def skip_to(self, docid, minweight):
try:
self.current = self.alldocs.skip_to(docid)
except StopIteration:
self.current = None
def at_end(self):
return self.current is None
def get_docid(self):
return self.current.docid
def get_maxweight(self):
return self.wtsource.get_maxweight()
def get_weight(self):
doc = self.db.get_document(self.current.docid)
return self.wtsource.get_weight(doc)ExternalWeightPostingSource doesn't restrict which documents match - it's intended to be combined with an existing query using OP_AND_MAYBE like so:
extwtps = xapian.ExternalWeightPostingSource(db, wtsource)
query = xapian.Query(query.OP_AND_MAYBE, query, xapian.Query(extwtps))
The wtsource would be a class like this one:
class WeightSource:
def get_maxweight(self):
return 12.34;
def get_weight(self, doc):
return some_func(doc.get_docid())
The Xappy source code contains a perfect example of a weight-only (non-filtering) PostingSource written in Python. This would be a good addition to the postingsource docs. I have slightly edited the original.
http://code.google.com/p/xappy/source/browse/trunk/xappy/searchconnection.py
http://trac.xapian.org/ticket/503
以及 , 一个用法的演示
http://xappy.googlecode.com/svn/trunk/xappy/unittests/weight_external.py