performance - High load average, low CPU usage - why? - Server Fault
We're seeing huge performance problems on a web application and we're trying to find the bottleneck. I am not a sysadmin so there is some stuff I don't quite get. Some basic investigation shows the CPU to be idle, lots of memory to be available, no swapping, no I/O, but a high average load.
The software stack on this server looks like this:
. Solaris 10 . Java 1.6 . WebLogic 10.3.5 (8 domains)
The applications running on this server talk with an Oracle database on a different server.
This server has 32GB of RAM and 10 CPUs (I think).
Running prstat -Z gives something like this:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
3836 ducm0101 2119M 2074M cpu348 58 0 8:41:56 0.5% java/225
24196 ducm0101 1974M 1910M sleep 59 0 4:04:33 0.4% java/209
6765 ducm0102 1580M 1513M cpu330 1 0 1:21:48 0.1% java/291
16922 ducm0102 2115M 1961M sleep 58 0 6:37:08 0.0% java/193
18048 root 3048K 2440K sleep 59 0 0:06:02 0.0% sa_comm/4
26619 ducm0101 2588M 2368M sleep 59 0 8:21:17 0.0% java/231
19904 ducm0104 1713M 1390M sleep 59 0 1:15:29 0.0% java/151
27809 ducm0102 1547M 1426M sleep 59 0 0:38:19 0.0% java/186
2409 root 15M 11M sleep 59 0 0:00:00 0.0% pkgserv/3
27204 root 58M 54M sleep 59 0 9:11:38 0.0% stat_daemon/1
27256 root 12M 8312K sleep 59 0 7:16:40 0.0% kux_vmstat/1
29367 root 297M 286M sleep 59 0 11:02:13 0.0% dsmc/2
22128 root 13M 6768K sleep 59 0 0:10:51 0.0% sendmail/1
22133 smmsp 13M 1144K sleep 59 0 0:01:22 0.0% sendmail/1
22003 root 5896K 240K sleep 59 0 0:00:01 0.0% automountd/2
22074 root 4776K 1992K sleep 59 0 0:00:19 0.0% sshd/1
22005 root 6184K 2728K sleep 59 0 0:00:31 0.0% automountd/2
27201 root 6248K 344K sleep 59 0 0:00:01 0.0% mount_stat/1
20964 root 2912K 160K sleep 59 0 0:00:01 0.0% ttymon/1
20947 root 1784K 864K sleep 59 0 0:02:22 0.0% utmpd/1
20900 root 3048K 608K sleep 59 0 0:00:03 0.0% ttymon/1
20979 root 77M 18M sleep 59 0 0:14:13 0.0% inetd/4
20849 daemon 2856K 864K sleep 59 0 0:00:03 0.0% lockd/2
17794 root 80M 1232K sleep 59 0 0:06:19 0.0% svc.startd/12
17645 root 3080K 728K sleep 59 0 0:00:12 0.0% init/1
17849 root 13M 6800K sleep 59 0 0:13:04 0.0% svc.configd/15
20213 root 84M 81M sleep 59 0 0:47:17 0.0% nscd/46
20871 root 2568K 600K sleep 59 0 0:00:04 0.0% sac/1
3683 ducm0101 1904K 1640K sleep 56 0 0:00:00 0.0% startWebLogic.s/1
23937 ducm0101 1904K 1640K sleep 59 0 0:00:00 0.0% startWebLogic.s/1
20766 daemon 5328K 1536K sleep 59 0 0:00:36 0.0% nfsmapid/3
20141 daemon 5968K 3520K sleep 59 0 0:01:14 0.0% kcfd/4
20093 ducm0101 2000K 376K sleep 59 0 0:00:01 0.0% pfksh/1
20797 daemon 3256K 240K sleep 59 0 0:00:01 0.0% statd/1
6181 root 4864K 2872K sleep 59 0 0:01:34 0.0% syslogd/17
7220 ducm0104 1268M 1101M sleep 59 0 0:36:35 0.0% java/138
27597 ducm0102 1904K 1640K sleep 59 0 0:00:00 0.0% startWebLogic.s/1
27867 root 37M 4568K sleep 59 0 0:13:56 0.0% kcawd/7
12685 ducm0101 4080K 208K sleep 59 0 0:00:01 0.0% vncconfig/1
ZONEID NPROC SWAP RSS MEMORY TIME CPU ZONE
42 135 22G 19G 59% 87:27:59 1.2% dsuniucm01
Total: 135 processes, 3167 lwps, load averages: 54.48, 62.50, 63.11
I understand that CPU is mostly idle, but the load average is high, which is quite strange to me. Memory doesn't seem to be a problem.
Running vmstat 15 gives something like this:
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s4 sd in sy cs us sy id
0 0 0 32531400 105702272 317 1052 126 0 0 0 0 13 13 -0 8 9602 107680 10964 1 1 98
0 0 0 15053368 95930224 411 2323 0 0 0 0 0 0 0 0 0 23207 47679 29958 3 2 95
0 0 0 14498568 95801960 3072 3583 0 2 2 0 0 3 3 0 21 22648 66367 28587 4 4 92
0 0 0 14343008 95656752 3080 2857 0 0 0 0 0 3 3 0 18 22338 44374 29085 3 4 94
0 0 0 14646016 95485472 1726 3306 0 0 0 0 0 0 0 0 0 24702 47499 33034 3 3 94
I understand that the CPU is mostly idle, no processes are waiting in the queue to be executed, little swapping is happening.
Running iostat 15 gives this:
tty sd0 sd1 sd4 ssd0 cpu
tin tout kps tps serv kps tps serv kps tps serv kps tps serv us sy wt id
0 676 324 13 8 322 13 8 0 0 0 159 8 0 1 1 0 98
1 1385 0 0 0 0 0 0 0 0 0 0 0 0 3 4 0 94
0 584 89 6 24 89 6 25 0 0 0 332 19 0 2 1 0 97
0 296 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 97
1 1290 43 5 24 43 5 22 0 0 0 297 20 1 3 3 0 94
Running netstat -i 15 gives the following:
When you say 'High Load average' I assume you mean that prstat shows for 'load average' at the bottom of the output figures of
Total: 135 processes, 3167 lwps, load averages: 54.48, 62.50, 63.11These numbers, look similar to the ones that top provides and probably mean the average queue size of running process. This isn't the percentage of processor time being used but how many 'things' are harassing the CPU for time to run. Admittedly, these do look quite high but this all depends on the app that you are running; the processes may not actually be doing much once they get their slot. See herefor a nice explanation regarding top.
I'm not familiar with WebLogic but I have noticed that, generally, with Apache Tomcat many Java threads can be spawned simultaneously for what appears as not many requests. It could be this that is causing those high average load numbers. Make sure that you are using connection pooling where appropriate to connect to the backend and consider upping the number of idle threads that are available to your app to handle connections (not sure how you do this on WebLogic; Tomcat has a per connector thread pool or a general executor thread pool). If you don't do this then brand new threads may be being spawned to process requests.
As to performance, you need to nail down what part of your app is suffering. Is it the processing that is happening in the WebLogic/Java side of things, the database access, DNS lookups (if they're being done for some reason...), network issues or something on the OS.
99% of the time it will be your code and how it talks to the database that is holding things up. Then it will be configuration of the web app. Past this point you will be working on squeezing the last milliseconds out of your app or looking at providing higher concurrency with the same hardware. For this finer grained performance tuning you need metrics.
For Java I'd suggest installing Java Melody. It can provide a lot of info regarding what your program is doing and help narrow down where it is spending time. I've only used it with Tomcat but should work fine with any Java EE container/servlet thingy.
There are a number of ways you can tune Java, so take a look at their performance guidelines (I'm sure you probably have) and make sure you're setting the correct Heap Size etc. suitable for your program. Java Melody can help you track down the size of Java's heap you're consuming as well as how hard the garbage collector is working/how often it is interrupting your program to clear objects.
I hope that has been helpful. If you provide any more information, I may be able to update this answer and hone it more towards your needs.
With some further investigation, it appears that the performance problem is mostly due to a high number of network calls between two systems (Oracle SSXA and UCM). The calls are quick but plenty and serialized, hence the low CPU usage (mostly waiting for I/O), the high load average (many calls waiting to be processed) and especially the long response times (by accumulation of small response times).
(2) how to understand the load average on the first line ("load average: 14.04, 14.02, 14.00")?
This article on load average uses a nice traffic analogy and is the best one I've found so far:Understanding Linux CPU Load - when should you be worried?. In your case, as people pointed out:
On multi-processor system, the load is relative to the number of processor cores available. The "100% utilization" mark is 1.00 on a single-core system, 2.00, on a dual-core, 4.00 on a quad-core, etc.
So, with a load average of 14.00 and 24 cores, your server is far from being overloaded.
参考:
http://www.linuxjournal.com/article/9001
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
[ lucene扩展 ] spellChecker原理分析 - MR-fox - 博客园
lucene中spellchecker简述
lucene 的扩展包中包含了spellchecker,利用它我们可以方便的实现拼写检查的功能,但是检查的效果(推荐的准确程度)需要开发者进行调整、优化。
lucene实现“拼写检查”的步骤
步骤1:建立spellchecker所需的索引文件
spellchecker也需要借助lucene的索引实现的,只不过其采用了特殊的分词方式和相关度计算方式。
建立spellchecker所需的索引文件可以用文本文件提供内容,一行一个词组,类似于字典结构。
例如(dic.txt):
麻辣烫 中文测试 麻辣酱 麻辣火锅 中国人 中华人民共和国 |
建立spellchecker索引的关键代码如下:
/** * 根据字典文件创建spellchecker所使用的索引。 * * @param spellIndexPath * spellchecker索引文件路径 * @param idcFilePath * 原始字典文件路径 * @throws IOException */ public void createSpellIndex(String spellIndexPath, String idcFilePath) throws IOException { Directory spellIndexDir = FSDirectory.open( new File(spellIndexPath)); SpellChecker spellChecker = new SpellChecker(spellIndexDir); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, null ); spellChecker.indexDictionary( new PlainTextDictionary( new File( idcFilePath)), config, false ); // close spellIndexDir.close(); spellChecker.close(); } |
这里使用了PlainTextDictionary对象,他实现了Dictionary接口,类结构如下图所示:
除了PlainTextDictionary(1 word per line),我们还可以使用:
- FileDictionary(1 string per line, optionally with a tab-separated integer value | 词组之间用tab分隔)
- LuceneDictionary(Lucene Dictionary: terms taken from the given field of a Lucene index | 用现有的index的term建立索引)
- HighFrequencyDictionary(HighFrequencyDictionary: terms taken from the given field of a Lucene index, which appear in a number of documents above a given threshold. | 在LuceneDictionary的基础上加入了一定的限定,term只有出现在各document中的次数满足一定数量时才被spellchecker采用)
例如我们采用luceneDictionary,主要代码如下:
/** * 根据指定索引中的字典创建spellchecker所使用的索引。 * * @param oriIndexPath * 指定原始索引 * @param fieldName * 索引字段(某个字段的字典) * @param spellIndexPath * 原始字典文件路径 * @throws IOException */ public void createSpellIndex(String oriIndexPath, String fieldName, String spellIndexPath) throws IOException { IndexReader oriIndex = IndexReader.open(FSDirectory.open( new File( oriIndexPath))); LuceneDictionary dict = new LuceneDictionary(oriIndex, fieldName); Directory spellIndexDir = FSDirectory.open( new File(spellIndexPath)); SpellChecker spellChecker = new SpellChecker(spellIndexDir); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, null ); spellChecker.indexDictionary(dict, config, true ); } |
我们对dic.txt建立索引后,可以对其内部文档和term进行进一步了解,如下:
Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLY<word:麻辣烫>> Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLY<word:中文测试>> Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLY<word:麻辣酱>> Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLY<word:麻辣火锅>> Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLY<word:中国人>> Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLY<word:中华人民共和国>> end1:人 end1:烫 end1:试 end1:酱 end1:锅 end2:国人 end2:测试 end2:火锅 end2:辣烫 end2:辣酱 end3:共和国 end4:民共和国 gram1:中 gram1:人 gram1:国 gram1:文 gram1:测 gram1:火 gram1:烫 gram1:试 gram1:辣 gram1:酱 gram1:锅 gram1:麻 gram1: gram2:中国 gram2:中文 gram2:国人 gram2:文测 gram2:测试 gram2:火锅 gram2:辣火 gram2:辣烫 gram2:辣酱 gram2:麻辣 gram2:麻 gram3:中华人 gram3:人民共 gram3:共和国 gram3:华人民 gram3:民共和 gram4:中华人民 gram4:人民共和 gram4:华人民共 gram4:民共和国 start1:中 start1:麻 start1: start2:中国 start2:中文 start2:麻辣 start2:麻 start3:中华人 start4:中华人民 word:中华人民共和国 word:中国人 word:中文测试 word:麻辣火锅 word:麻辣酱 word:麻辣烫 |
可以看出,每一个词组(dic.txt每一行的内容)被当成一个document,然后采用特殊的分词方式对其进行分词,我们可以看出field的名称比较奇怪,例如:end1,end2,gram1,gram2等等。
为什么这么做,什么原理?我们先留下这个疑问,看完效果后再说明!
步骤二:spellchecker的“检查建议”
我们使用第一步创建的索引,利用spellChecker.suggestSimilar方法进行拼写检查。全部代码如下:
package com.fox.lab; import java.io.File; import java.io.IOException; import java.util.Iterator; import org.apache.lucene.index.IndexReader; import org.apache.lucene.search.spell.LuceneDictionary; import org.apache.lucene.search.spell.SpellChecker; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; /** * @author huangfox * @createDate 2012-2-16 * @eMail [email protected] */ public class DidYouMeanSearcher { SpellChecker spellChecker = null ; LuceneDictionary dict = null ; /** * * @param spellCheckIndexPath * spellChecker索引位置 */ public DidYouMeanSearcher(String spellCheckIndexPath, String oriIndexPath, String fieldName) { Directory directory; try { directory = FSDirectory.open( new File(spellCheckIndexPath)); spellChecker = new SpellChecker(directory); IndexReader oriIndex = IndexReader.open(FSDirectory.open( new File( oriIndexPath))); dict = new LuceneDictionary(oriIndex, fieldName); } catch (IOException e) { e.printStackTrace(); } } /** * 设定精度,默认0.5 * * @param v */ public void setAccuracy( float v) { spellChecker.setAccuracy(v); } /** * 针对检索式进行spell check * * @param queryString * 检索式 * @param suggestionsNumber * 推荐的最大数量 * @return */ public String[] search(String queryString, int suggestionsNumber) { String[] suggestions = null ; try { // if (exist(queryString)) // return null; suggestions = spellChecker.suggestSimilar(queryString, suggestionsNumber); } catch (IOException e) { e.printStackTrace(); } return suggestions; } private boolean exist(String queryString) { Iterator<String> ite = dict.getWordsIterator(); while (ite.hasNext()) { if (ite.next().equals(queryString)) return true ; } return false ; } } |
测试效果:
package com.fox.lab; import java.io.IOException; public class DidYouMeanMainApp { /** * @param args */ public static void main(String[] args) { // 创建index DidYouMeanIndexer indexer = new DidYouMeanIndexer(); String spellIndexPath = "D:\\spellchecker" ; String idcFilePath = "D:\\dic.txt" ; String oriIndexPath = "D:\\solrHome\\example\\solr\\data\\index" ; String fieldName = "ab" ; DidYouMeanSearcher searcher = new DidYouMeanSearcher(spellIndexPath, oriIndexPath, fieldName); searcher.setAccuracy( 0 .5f); int suggestionsNumber = 15 ; String queryString = "麻辣将" ; // try { // indexer.createSpellIndex(spellIndexPath, idcFilePath); // indexer.createSpellIndex(oriIndexPath, fieldName, spellIndexPath); // } catch (IOException e) { // e.printStackTrace(); // } String[] result = searcher.search(queryString, suggestionsNumber); if (result == null || result.length == 0 ) { System.out.println( "我不知道你要什么,或许你就是对的!" ); } else { System.out.println( "你是不是想找:" ); for ( int i = 0 ; i < result.length; i++) { System.out.println(result[i]); } } } } |
输出:
你是不是想找: 麻辣酱 麻辣火锅 麻辣烫 |
将queryString改为“中文测式”,输出:
你是不是想找: 中文测试 |
当输入正确时,例如“中文测试”,则输出:
我不知道你要什么,或许你就是对的! |
拼写检查的基本功能实现了,虽然还存在很多问题需要改进调整。我们先来了解其中两个基本原理。
第一原理:N-gram
我们要实现spellchecker,其实简单理解就是将用户输入的词组(英文为单词,中文为词组)和字典里面“标准”的词组进行“相似性”比较,并给出相似程度最高的词组。
那么如何比较两个字符串的相似程度就是spellchecker的关键所在。
字符串P 的N-gram 是P 中任意长度为N 的子串。例如,单词waist 的Bigram 有wa、ai、is 和st 四个。对于给定的字符串P 和W,其N-gram 相似度gram-count(P,W) 定义为同时在P 和W 中出现的N-gram 数目。在lucene的spellchecker中对N-gram进行了扩展,对整个单词、单词的头尾都做了处理,例如:麻辣烤翅,分解成:
start2:麻 start3:麻辣 end2:烤翅 end3:辣烤翅 gram2:烤翅 gram2:辣烤 gram2:麻辣 gram2:麻 gram3:辣烤翅 gram3:麻辣烤 gram3:麻辣 word:麻辣烤翅 |
当用户输入“麻辣靠翅”时,被分解成:
end2:靠翅 end3:辣靠翅 gram2:靠翅 gram2:辣靠 gram2:麻辣 gram2:麻 gram3:辣靠翅 gram3:麻辣靠 gram3:麻辣 start2:麻 start3:麻辣 word:麻辣靠翅
并将这些term组成一个用OR连接的检索式(不同的term可能赋予不同的权重),在spellchecker的索引里进行检索,即可匹配到文档“麻辣烤翅”。但是不是就要把它推荐(suggest)出来呢?还要看他们的相识度是否符合要求。在lucene的spellchecker中,默认相似度为0.5。
lucene——spellchecker的n-gram分词算法如下:
private static void addGram(String text, Document doc, int ng1, int ng2) { int len = text.length(); for ( int ng = ng1; ng <= ng2; ng++) { String key = "gram" + ng; String end = null ; for ( int i = 0 ; i < len - ng + 1 ; i++) { String gram = text.substring(i, i + ng); Field ngramField = new Field(key, gram, Field.Store.NO, Field.Index.NOT_ANALYZED); // spellchecker does not use positional queries, but we want freqs // for scoring these multivalued n-gram fields. ngramField.setIndexOptions(IndexOptions.DOCS_AND_FREQS); doc.add(ngramField); if (i == 0 ) { // only one term possible in the startXXField, TF/pos and norms aren't needed. Field startField = new Field( "start" + ng, gram, Field.Store.NO, Field.Index.NOT_ANALYZED); startField.setIndexOptions(IndexOptions.DOCS_ONLY); startField.setOmitNorms( true ); doc.add(startField); } end = gram; } if (end != null ) { // may not be present if len==ng1 // only one term possible in the endXXField, TF/pos and norms aren't needed. Field endField = new Field( "end" + ng, end, Field.Store.NO, Field.Index.NOT_ANALYZED); endField.setIndexOptions(IndexOptions.DOCS_ONLY); endField.setOmitNorms( true ); doc.add(endField); } } } |
第二原理:相似度计算(stringDistance)
在lucene的spellchecker中,StringDistance作为接口,有三个实现类,如下:
- JaroWinklerDistance
- LevensteinDistance
- NGramDistance
我们这里采用LevensteinDistance进行字符串相似度计算。LevensteinDistance就是edit distance(编辑距离)。
编辑距离,又称Levenshtein距离(也叫做Edit Distance),是指两个字串之间,由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。
sitten (k→s)
sittin (e→i)
sitting (→g)
俄罗斯科学家Vladimir Levenshtein在1965年提出这个概念。
lucene中算法如下:
public float getDistance (String target, String other) { char [] sa; int n; int p[]; //'previous' cost array, horizontally int d[]; // cost array, horizontally int _d[]; //placeholder to assist in swapping p and d /* The difference between this impl. and the previous is that, rather than creating and retaining a matrix of size s.length()+1 by t.length()+1, we maintain two single-dimensional arrays of length s.length()+1. The first, d, is the 'current working' distance array that maintains the newest distance cost counts as we iterate through the characters of String s. Each time we increment the index of String t we are comparing, d is copied to p, the second int[]. Doing so allows us to retain the previous cost counts as required by the algorithm (taking the minimum of the cost count to the left, up one, and diagonally up and to the left of the current cost count being calculated). (Note that the arrays aren't really copied anymore, just switched...this is clearly much better than cloning an array or doing a System.arraycopy() each time through the outer loop.) Effectively, the difference between the two implementations is this one does not cause an out of memory condition when calculating the LD over two very large strings. */ sa = target.toCharArray(); n = sa.length; p = new int [n+ 1 ]; d = new int [n+ 1 ]; final int m = other.length(); if (n == 0 || m == 0 ) { if (n == m) { return 1 ; } else { return 0 ; } } // indexes into strings s and t int i; // iterates through s int j; // iterates through t char t_j; // jth character of t int cost; // cost for (i = 0 ; i<=n; i++) { p[i] = i; } for (j = 1 ; j<=m; j++) { t_j = other.charAt(j- 1 ); d[ 0 ] = j; for (i= 1 ; i<=n; i++) { cost = sa[i- 1 ]==t_j ? 0 : 1 ; // minimum of cell to the left+1, to the top+1, diagonally left and up +cost d[i] = Math.min(Math.min(d[i- 1 ]+ 1 , p[i]+ 1 ), p[i- 1 ]+cost); } // copy current distance counts to 'previous row' distance counts _d = p; p = d; d = _d; } // our last action in the above loop was to switch d and p, so p now // actually has the most recent cost counts return 1 .0f - (( float ) p[n] / Math.max(other.length(), sa.length)); } |
需要改进的地方
1.精度不高,特别是对于两个字的词组。可以在距离计算(相似度计算)方面进行调整。
2.没有拼音的功能,例如麻辣kao翅,将无法进行校正。
3.对于字符串中出现的错误无法进行校正,例如“常州哪里有卖变态麻辣靠翅”。
使用Lucene的MoreLikeThisQuery实现相关内容推荐
在分析MoreLikeThisQuery之前,首先介绍一下MoreLikeThis。
在实现搜索应用的时候,时常会遇到"更多相似文章","更多相关问题"之类的需求,也即根据当前文档的文本内容,在索引库中查询相类似的文章。
我们可以使用MoreLikeThis实现此功能:
IndexReader reader = IndexReader.open(……); IndexSearcher searcher = new IndexSearcher(reader); MoreLikeThis mlt = new MoreLikeThis(reader); Reader target = ... //此是一个io reader,指向当前文档的文本内容。 Query query = mlt.like( target); //根据当前的文本内容,生成查询对象。 Hits hits = searcher.search(query); //查询得到相似文档的结果。 |
MoreLikeThis的Query like(Reader r)函数如下:
public Query like(Reader r) throws IOException { return createQuery(retrieveTerms(r)); //其首先从当前文档的文本内容中抽取term,然后利用这些term构建一个查询对象。 } |
public PriorityQueue <Object[]> retrieveTerms(Reader r) throws IOException { Map<String,Int> words = new HashMap<String,Int>(); //根据不同的域中抽取term,到底根据哪些域抽取,可用函数void setFieldNames(String[] fieldNames)设定。 for (int i = 0; i < fieldNames.length; i++) { String fieldName = fieldNames[i]; addTermFrequencies(r, words, fieldName); } //将抽取的term放入优先级队列中 return createQueue(words); } |
private void addTermFrequencies(Reader r, Map<String,Int> termFreqMap, String fieldName) throws IOException { //首先对当前的文本进行分词,分词器可以由void setAnalyzer(Analyzer analyzer)设定。 TokenStream ts = analyzer.tokenStream(fieldName, r); int tokenCount=0; TermAttribute termAtt = ts.addAttribute(TermAttribute.class); //遍历分好的每一个词 while (ts.incrementToken()) { String word = termAtt.term(); tokenCount++; //如果分词后的term的数量超过某个设定的值,则停止,可由void setMaxNumTokensParsed(int i)设定。 if(tokenCount>maxNumTokensParsed) { break; } //如果此词小于最小长度,或者大于最大长度,或者属于停词,则属于干扰词。 //最小长度由void setMinWordLen(int minWordLen)设定。 //最大长度由void setMaxWordLen(int maxWordLen)设定。 //停词表由void setStopWords(Set<?> stopWords)设定。 if(isNoiseWord(word)){ continue; } // 统计词频tf Int cnt = termFreqMap.get(word); if (cnt == null) { termFreqMap.put(word, new Int()); } else { cnt.x++; } } } |
private PriorityQueue createQueue(Map<String,Int> words) throws IOException { //根据统计的term及词频构造优先级队列。 int numDocs = ir.numDocs(); FreqQ res = new FreqQ(words.size()); // 优先级队列,将按tf*idf排序 Iterator<String> it = words.keySet().iterator(); //遍历每一个词 while (it.hasNext()) { String word = it.next(); int tf = words.get(word).x; //如果词频小于最小词频,则忽略此词,最小词频可由void setMinTermFreq(int minTermFreq)设定。 if (minTermFreq > 0 && tf < minTermFreq) { continue; } //遍历所有域,得到包含当前词,并且拥有最大的doc frequency的域 String topField = fieldNames[0]; int docFreq = 0; for (int i = 0; i < fieldNames.length; i++) { int freq = ir.docFreq(new Term(fieldNames[i], word)); topField = (freq > docFreq) ? fieldNames[i] : topField; docFreq = (freq > docFreq) ? freq : docFreq; } //如果文档频率小于最小文档频率,则忽略此词。最小文档频率可由void setMinDocFreq(int minDocFreq)设定。 if (minDocFreq > 0 && docFreq < minDocFreq) { continue; } //如果文档频率大于最大文档频率,则忽略此词。最大文档频率可由void setMaxDocFreq(int maxFreq)设定。 if (docFreq > maxDocFreq) { continue; } if (docFreq == 0) { continue; } //计算打分tf*idf float idf = similarity.idf(docFreq, numDocs); float score = tf * idf; //将object的数组放入优先级队列,只有前三项有用,按照第三项score排序。 res.insertWithOverflow(new Object[]{word, // 词 topField, // 域 Float.valueOf(score), // 打分 Float.valueOf(idf), // idf Integer.valueOf(docFreq), // 文档频率 Integer.valueOf(tf) //词频 }); } return res; } |
private Query createQuery(PriorityQueue q) { //最后生成的是一个布尔查询 BooleanQuery query = new BooleanQuery(); Object cur; int qterms = 0; float bestScore = 0; //不断从队列中优先取出打分最高的词 while (((cur = q.pop()) != null)) { Object[] ar = (Object[]) cur; TermQuery tq = new TermQuery(new Term((String) ar[1], (String) ar[0])); if (boost) { if (qterms == 0) { //第一个词的打分最高,作为bestScore bestScore = ((Float) ar[2]).floatValue(); } float myScore = ((Float) ar[2]).floatValue(); //其他的词的打分除以最高打分,乘以boostFactor,得到相应的词所生成的查询的boost,从而在当前文本文档中打分越高的词在查询语句中也有更高的boost,起重要的作用。 tq.setBoost(boostFactor * myScore / bestScore); } try { query.add(tq, BooleanClause.Occur.SHOULD); } catch (BooleanQuery.TooManyClauses ignore) { break; } qterms++; //如果超过了设定的最大的查询词的数目,则停止,最大查询词的数目可由void setMaxQueryTerms(int maxQueryTerms)设定。 if (maxQueryTerms > 0 && qterms >= maxQueryTerms) { break; } } return query; } |
MoreLikeThisQuery只是MoreLikeThis的封装,其包含了MoreLikeThis所需要的参数,并在rewrite的时候,由MoreLikeThis.like生成查询对象。
- String likeText;当前文档的文本
- String[] moreLikeFields;根据哪个域来抽取查询词
- Analyzer analyzer;分词器
- float percentTermsToMatch=0.3f;最后生成的BooleanQuery之间都是SHOULD的关系,其中至少有多少比例必须得到满足
- int minTermFrequency=1;最少的词频
- int maxQueryTerms=5;最多的查询词数目
- Set<?> stopWords=null;停词表
- int minDocFreq=-1;最小的文档频率
public Query rewrite(IndexReader reader) throws IOException { MoreLikeThis mlt=new MoreLikeThis(reader); mlt.setFieldNames(moreLikeFields); mlt.setAnalyzer(analyzer); mlt.setMinTermFreq(minTermFrequency); if(minDocFreq>=0) { mlt.setMinDocFreq(minDocFreq); } mlt.setMaxQueryTerms(maxQueryTerms); mlt.setStopWords(stopWords); BooleanQuery bq= (BooleanQuery) mlt.like(new ByteArrayInputStream(likeText.getBytes())); BooleanClause[] clauses = bq.getClauses(); bq.setMinimumNumberShouldMatch((int)(clauses.length*percentTermsToMatch)); return bq; } |
举例,对于http://topic.csdn.net/u/20100501/09/64e41f24-e69a-40e3-9058-17487e4f311b.html?1469中的帖子
我们姑且将相关问题中的帖子以及其他共20篇文档索引。
File indexDir = new File("TestMoreLikeThisQuery/index"); IndexReader reader = IndexReader.open(indexDir); IndexSearcher searcher = new IndexSearcher(reader); //将《IT外企那点儿事》作为likeText,从文件读入。 StringBuffer contentBuffer = new StringBuffer(); BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream("TestMoreLikeThisQuery/IT外企那点儿事.txt"), "utf-8")); String line = null; while((line = input.readLine()) != null){ contentBuffer.append(line); } String content = contentBuffer.toString(); //分词用中科院分词 MoreLikeThisQuery query = new MoreLikeThisQuery(content, new String[]{"contents"}, new MyAnalyzer(new ChineseAnalyzer())); //将80%都包括的词作为停词,在实际应用中,可以有其他的停词策略。 query.setStopWords(getStopWords(reader)); //至少包含5个的词才认为是重要的 query.setMinTermFrequency(5); //只取其中之一 query.setMaxQueryTerms(1); TopDocs docs = searcher.search(query, 50); for (ScoreDoc doc : docs.scoreDocs) { Document ldoc = reader.document(doc.doc); String title = ldoc.get("title"); System.out.println(title); } |
static Set<String> getStopWords(IndexReader reader) throws IOException{ HashSet<String> stop = new HashSet<String>(); int numOfDocs = reader.numDocs(); int stopThreshhold = (int) (numOfDocs*0.7f); TermEnum te = reader.terms(); while(te.next()){ String text = te.term().text(); if(te.docFreq() >= stopThreshhold){ stop.add(text); } } return stop; } |
结果为: 揭开外企的底儿(连载六)——外企招聘也有潜规则.txt 去央企还是外企,帮忙分析下.txt 哪种英语教材比较适合英语基础差的人.txt 有在达内外企软件工程师就业班培训过的吗.txt 两个月的“骑驴找马”,面试无数家公司的深圳体验.txt 一个看了可能改变你一生的小说《做单》,外企销售经理做单技巧大揭密.txt HR的至高机密:20个公司绝对不会告诉你的潜规则.txt |