Solr平台化搜索实战必知场景

标签: 未分类 | 发表时间:2012-09-21 10:06 | 作者:yingyuan
出处:http://rdc.taobao.com/team/jm


【提醒】


这个page是个人汇总了maillist、自己在搜索平台化、通用化过程中遇到的种种需求,为了避开必要的“敬业竞争禁止等”,特地从外网搜罗并汇总代表性的需求。构成基于solr搜索“策略”参考、搜索应用查询的方案参考,但是,性能问题特别是高级用法,在大数据量时,务必压测,做到心里有底。


这里面给出的方法绝大部分基于solr接口、配置。不针对深入定制的详细说明。针对深入定制的经验,这里找不到答案,有兴趣私下交流。


 整个汇总抛砖引入,各个点没有做系统、全面的论证,内容基本来自网络,总体方向和大点没有问题。如果发现细处不对,也请指出。谢谢!


目录

1 3.4.0 得分的问题…

1

2
配置方法…

1

3
问题和需求…

3

4 Payload问题…

3

5 Custom sort (score + custom value)
4

6 BoostQParserPlugin.

4

7 how can I limit by score before sorting in a
solr query.

6

8 Score
filter.

7

9 Boost score for early
matches.

10

10 Solr: How can I get all documents ordered by
score with a list of keywords?.

11

11 Solr changes document’s score when its random
field value altered.

13

12 Relevance
Customization.

15

13 Modify SOLR scoring.

15

14 Change order before
returning data.

16

15 limiting the
total number of documents matched.

17

 

3.4.0
得分的问题

(7)
得分因子是可以调整的,但是得分因子的增加、得分公式的扩展,无法直接从solr配置插入。—-但是,可以扩展lucene的代码或者参数
spanquery,重新一个query,插入solr,这样工作量稍大.另外,社区提供了bm25、pagerank等排序batch,对lucene
有所以了解后,就可以直接引用了。

 

(16)
在排序上,对与去重或者对应基于时间动态性上,还没有现成的支持。去重是指排序的前几条结果,可能某个域值完全相同了,或者某几个域值完全相同,导致看起来,靠前的结果带有一些关联字段的“聚集性”,对有些应用来说,并不是最好的。

在时间因素上动态性,也没有直接支持,也只能靠间接的按时间排序来实现。
这个问题其实不是lucene、solr要关注的吧,应该是应用的特殊性导致的吧。


配置方法


全局配置 schema.xml

Similarity

A (global) declaration can be used to specify a
custom Similarity implementation that you want Solr to use when
dealing with your index. A Similarity can be specified either by
referring directly to the name of a class with a no-arg
constructor…

 

 

<similarity
class=”org.apache.lucene.search.similarities.DefaultSimilarity”/>

…or by referencing a
SimilarityFactory implementation, which may take
optional init params….

<similarity
class=”solr.DFRSimilarityFactory”>

  <str
name=”basicModel”>P</str>

  <str
name=”afterEffect”>L</str>

  <str
name=”normalization”>H2</str>

  <float
name=”c”>7</float>

</similarity>

Begining with Solr4.0, Similarity
factories such as SchemaSimilarityFactory
can also support specifying specific
Similarity implementations on individual field types…

 

<types>

 
<fieldType name=”text_dfr”
class=”solr.TextField”>

   
<analyzer
class=”org.apache.lucene.analysis.standard.StandardAnalyzer”/>

   
<similarity
class=”solr.DFRSimilarityFactory”>

     
<str
name=”basicModel”>I(F)</str>

     
<str
name=”afterEffect”>B</str>

     
<str
name=”normalization”>H2</str>

   
</similarity>

 
</fieldType>

 
<fieldType name=”text_ib”
class=”solr.TextField”>

   
<analyzer
class=”org.apache.lucene.analysis.standard.StandardAnalyzer”/>

   
<similarity
class=”solr.IBSimilarityFactory”>

     
<str
name=”distribution”>SPL</str>

     
<str
name=”lambda”>DF</str>

     
<str
name=”normalization”>H2</str>

   
</similarity>

 
</fieldType>

  …

</types>

<similarity
class=”solr.SchemaSimilarityFactory”/>

If no (global) is configured in the schema.xml file,
an implicit instance of DefaultSimilarityFactory
is used.

 


问题和需求

By
DefaultComputerValue

By CustomScore, By
DefaultComputerValue

CustomScore*fa +
DefaultComputerValue* fb

Doc1  10\100  10*0.8+
100*0.2=28

Doc2  1\99   
1*0.8 + 99 *0.2 =20.6

Doc3  3\98   
3*0.8+ 98* 0.2 =22

Doc4  20\50  
20*0.8+ 50*0.2=36

 

Solr3.4.0
得分代码分析

abstract class
SimilarityFactory


成员变量  public abstract
Similarity getSimilarity();

 

Payload问题

http://wiki.apache.org/lucene-java/Payloads

Scoring payloads involves
overriding the Similarity.scorePayload() method. For example, if
one has implemented storing a Float payload, it could be used for
scoring in the following way:

  public float scorePayload(byte [] payload, int offset, int length) {
    assert length == 4;
    int accum = ((payload[0+offset]&0xff)) |
                ((payload[1+offset]&0xff)<<8) |
                ((payload[2+offset]&0xff)<<16)  |
                ((payload[3+offset]&0xff)<<24);
    return Float.intBitsToFloat(accum);
  }

Don’t forget to activate
your Similarity implementation using IndexSearcher.setSimilarity().
Also, note that even then not all queries will actually make use of
your method. For example, you will need to use BoostingTermQuery
instead of TermQuery. QueryParser currently (Lucene 2.3.2) always
uses TermQuery and you will need to extend QueryParser and
overwrite getFieldQuery().

Note, that is just one
possible way of scoring a payload. Payloads are application
specific. For example payload Token Filters see the payload package
in the contrib/Analysis module.

Custom sort (score + custom
value)

http://grokbase.com/t/lucene/solr-user/08b25j6ked/custom-sort-score-custom-value

Hi,

I want to implement a custom sort in
Solr based on a combination of relevance (Solr gives me it yet
=> score) and a custom value I’ve calculated
previously for each document. I see two options:

1. Use a function query (I’m using a
DisMaxRequestHandler).
2. Create a component that set SortSpec with a sort that has a
custom
ComparatorSource (similar to QueryElevationComponent).

The first option has the problem:
While the relevance value changes for
every query, my custom value is constant for each doc. It implies
queries
with documents that have high relevance are less affected with my
custom
value. On the other hand, queries with low relevance are affected a
lot with my custom value. Can it be proportional with a function
query? (i.e. docs with low relevance are less affected by my custom
value).

 

The second option has the problem:
Solr score isn’t normalized. I need it normalized in order to apply
my custom value in the sortValue function in
ScoreDocComparator.What do you think? What’s the best option in
that case? Another option?

Thank you in advance,

George

BoostQParserPlugin

http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/search/BoostQParserPlugin.html

org.apache.solr.search

Class
BoostQParserPlugin

 

http://stackoverflow.com/questions/3035831/solr-lucene-scorer

Scorer are parts of lucene
Queries via the ‘weight’ query method.

In short, the framework
calls Query.weight(..).scorer(..) . Have a look at

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Query.html

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Weight.html

http://lucene.apache.org/jva/2_4_0/api/org/apache/lucene/search/Scorer.html

To use your own Query class
in Solr, you’ll need to implement your own solr QueryParserPlugin
that uses your own QParser that generates your previously
implemented lucene Query. You then can use it in Solr specified
here:

http://wiki.apache.org/solr/SolrPlugins#QParserPlugin

This part on implementation
should stay simple as this is just some glueing code.

Enjoy hacking
Solr!

share|improve this
answer

answered Jun 14 ’10 at
10:33

 

 

You can override the logic
solr scorer uses. Solr uses DefaultSimilarity class for scoring. 1)
make a class extending DefaultSimilarity. 2) override the functions
tf(), idf() etc according to your need.

public class
CustomSimilarity extends DefaultSimilarity {

public CustomSimilarity()
{

  super();

}

 public float tf(int
freq) {

  //your
code 

  return (float)
1.0;

}

public float idf(int
docFreq, int numDocs) {

  //your code

  return (float)
1.0;

}

}

3) After creating a class
compile and make a jar. 4) put the jar in lib folder of
corresponding index or core. 5) Change the schema.xml of
corresponding index .CustomSimilarity”/>

You can check out various
factors affecting score here

For your requirement you can
create buckets if your score is in specific range. Also read about
field boosting, document boosting etc. That might be helpful in
your case.

 

http://stackoverflow.com/questions/11748487/how-can-i-filter-solr-results-by-custom-score


How can I filter SOLR results by custom score

I’m using solr function
queries to generate my own custom score. I achieve this using
something along these lines:

   q=_val_:"my_custom_function()"

This populates the score
field as expected, but it also includes documents that score 0. I
need a way to filter the results so that scores below zero are not
included.

I realize that I’m using
score in a non-standard way and that normally the score that
lucene/solr produce is not absolute. However, producing my own
score works really well for my needs.

I’ve tried using {!frange
l=0} but this causes the score for all documents to be
“1.0″.

I suspect pseudo-fields
could be used, but since solr 4 is still alpha, I’m looking for a
way to do it using Solr 3.1.


how can I limit by score
before sorting in a solr query

I am searching “product
documents”. In other words, my solr documents are product records.
I want to get say the top 50 matching products for a query. Then I
want to be able to sort the top 50 scoring documents by name or
price. I’m not seeing much on how to do this, since sorting by
score, then by name or price won’t really help, since scores are
floats.

I wouldn’t mind if I could
do something like map the scores to ranges (like a score of
8.0-8.99 would go in the 8 bucket score), then sort by range, then
by names, but since there is basically no normalization to scoring,
this would still make things a bit harder.

Tl;dr How do I exclude low
scoring documents from the solr result set before sorting? solr
scoring

share|improve this
question

asked Dec 7 ’10 at
22:21

 

3
Answers

You can use frange to
achieve this, as long as you don’t want to sort on score (in which
case I guess you could just do the filtering on the client side).
Your query would be something along the lines of:

q={!frange
l=5}query($qq)&qq=[awesome
product]&sort=price asc

Set the l argument in the
q-frange-parameter to the lower bound you want to filter score on,
and replace the qq parameter with your user query.

answered Dec 8 ’10 at
10:23

Karl Johansson

1,046310

 

thanks, since I can get a
reasonable frange from the first time the results are displayed
sorted by score alone, this works great! – Zak Dec 9 ’10 at
18:40

I don’t think you can simply
exclude low scoring documents from the solr result set before
sorting

because the relevance score
is only meaningful for a given combination of search query and
resulting document list. I.e. scores are only meaningful within a
given search and you cannot set some threshold for all
searches.

If you were using Java (or
PHP) you could get the top 50 documents and then re-sort this list
in your programming language but I don’t think you can do it with
just SOLR.

Anyway, I would recommend
you don’t go down this route of re-sorting the results from SOLR,
as it will simply confuse the user. People expect search results to
be like Google (and most other search engines), where results come
back in some form of TFIDF ranking.

Having said that, you could
use some other criteria to separate documents with the same
relevance scores by adding an index-time boost factor based on a
price range scale.

I’d suggest you use SOLR to
its strengths and use facets. Provide a price range facet on the
left (like Ebay, Amazon, et al.) and/or a product category facet,
etc. Also provide a “sort” widget to allow the results to be sorted
by product name, if the user wants it.

[EDIT] this question might
also be useful:

Digg-like search result
ranking with Lucene / Solr ?

As observed by Karl
Johansson, you could do the filtering on the client side: load the
first 50 rows of the response (sorted by score desc) and then
manipulate them in JS for example.

The jQuery DataTables plugin
works fantastically for that kind of thing: sorting, sorting on
multiple columns, dynamic filtering, etc. — and with only 50 rows
it would be very fast too, so that users can “play” with the
sorting and filtering until they find what they want.

Score filter

http://lucene.472066.n3.nabble.com/score-filter-td493438.html

Hello, Is there a way to set a score filter? I tried
“+score:[1.2 TO *]” but it did not work.
Many thanks,

What’s the motivation for
wanting to do this?  The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. 
It is only meaningful for comparing the results of a specific query
with a specific instance of the index.  In other words, it
isn’t useful to filter on b/c there is no way of knowing what a
good cutoff value would be.  So, you won’t be able
to do score:[1.2 TO  *] because score is a
not an actual Field.

 

That being said, you
probably could implement a HitCollector at the Lucene level and
somehow hook it into Solr to do what you want.  Or, of course, just
stop processing the results in your app after you see a score below
a certain value.  Naturally, this still
means you have to retrieve the results.

 

Re: score filter

In my case, for example
searching a book. Some of the returned documents are with high
relevance (score > 3), but some of document with low
score (<0.01) are useless.

 

Without a “score filter”, I
have to go through each document to find out the number of
documents I’m interested (score > nnn). This causes
some problem for pagination.  For example if I only
need to display the first 10 records I need to retrieve all 1000
documents to figure out the number of meaningful documents which
have score > nnn.

Thx,

Kevin

 

What’s the motivation for
wanting to do this?  The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. 
It is only meaningful for comparing the results of a specific query
with a specific instance of the index.  In other words, it
isn’t useful to filter on b/c there is no way of knowing what a
good cutoff value would be.  So, you won’t be able
to do score:[1.2 TO *] because score is a not an actual
Field.

 

That being said, you
probably could implement a HitCollector at the Lucene level and
somehow hook it into Solr to do what you want.  Or, of course, just
stop processing the results in your app after you see a score below
a certain value.  Naturally, this still
means you have to retrieve the results.

-Grant

 

Re: score filter

At what point do you draw
the line? 
0.01 is too low, but what about 0.5 or 0.3?  In fact, there may be
queries where 0.01 is relevant.

 

Relevance is a tricky thing
and putting in arbitrary cutoffs is usually not a good thing. An
alternative might be to instead look at the difference between
scores and see if the gap is larger than some delta, but even that
is subject to the vagaries of scoring.

 

What kind of relevance
testing have you done so far to come up with  

those values?  See also

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-in-Search/

 

Re: score filter

Just did some research. It
seems that it’s doable with additional code added to Solr but not
out of box. Thank you, Grant.

 

At what point do you draw
the line? 
0.01 is too low, but what about 0.5 or 0.3?  In fact, there may be
queries where 0.01 is relevant.

 

Relevance is a tricky thing
and putting in arbitrary cutoffs is usually not a good thing. An
alternative might be to instead look at the difference between
scores and see if the gap is larger than some delta, but even that
is subject to the vagaries of scoring.

 

What kind of relevance
testing have you done so far to come up with those
values?  See
also

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-in-Search/

 

Re: score filter

Don’t bother doing this. It
doesn’t work.

This seems like a good idea,
something that would be useful for almost every Lucene
installation, but it isn’t in Lucene because it does not work in
the real world.

 

A few problems:

* Some users want every
match and don’t care how many pages of results they look
at.

 

* Some users are very bad at
creating queries that match their information needs. Others are
merely bad, not very bad. The good matches for their query are on
top, but the good matches for

their information need are
on the third page.

 

* Misspellings can put the
right match (partial match) at the bottom. I did this yesterday at
my library site, typeing “Katherine Kerr” instead of the correct
“Katharine Kerr”.

Their search engine showed
no matches (grrr), so I had to search again with “Kerr”.

 

* Most users do not know how
to repair their queries, like I did with “Katherine Kerr”, changing
it to “Kerr”. Even if they do, you shouldn’t make them. Just show
the weakly relevant results.

 

* Documents have errors,
just like queries. I find bad data on our site about once a month,
and we have professional editors. We still haven’t fixed our entry
for “Betty Page” to read “Bettie Page”.

 

* People may use non-title
words in the query, like searching for “batman” when they want “The
Dark Knight”.

 

So, don’t do this. If you
are forced to do it, make sure that you measure your search quality
before and after it is implemented, because it will get worse. Then
you can stop doing it.

wunder

 

Re: score filter

+1.  Of course it is
doable, but that doesn’t mean you should, which is what I was
trying to say before, (but was typing on my iPod so it wasn’t fast)
and which Walter has done so.  It is entirely
conceivable to me that someone could search for a very common word
such that the score of all relevant (and thus, “good”) documents
are below your predefined threshold.

 

At any rate, proceed at your
own peril. 
To implement it, look into the SearchComponent
functionality.

 

Re: score filter

Hello Grant,

I need to frame a query that
is a combination of two query parts and I use a ‘function’ query to
prepare the same. Something like:

q={!type=func q.op=AND
df=text}product(query($uq,0.0),query($cq,0.1))

 

where $uq and $cq are two
queries.

 

Now, I want a search result
returned only if I get a hit on $uq. So, I specify default value of
$uq query as 0.0 in order for the final score to be zero in cases
where $uq doesn’t record a hit. Even though, the scoring works as
expected (i.e, document that don’t match $uq have a score of zero),
all the documents are returned as search results. Is there a way to
filter search results that have a score of zero?

 

Thanks for your
help,

Debdoot

 

Re: score filter

: I need to frame a query
that is a combination of two query parts and I use a ‘function’
query to prepare the same. Something like:

: q={!type=func q.op=AND
df=text}product(query($uq,0.0),query($cq,0.1))

: where $uq and $cq are two
queries.

:

: Now, I want a search
result returned only if I get a hit on $uq. So, I specify default
value of $uq query as 0.0 in order for the final score to be zero
in cases where $uq doesn’t record a hit. Even though, the scoring
works as expected (i.e, document that don’t match $uq have a score
of zero), all  the documents are
returned as search results. Is there a way to filter search results
that have a score of zero?

 

a) you could wrap your query
in {!frange} .. but that will make everything

that does have a
value> 0.0 get the same final score

 

b) you could use an
fq={!frange} that refers back to your original $q

 

c) you could just use an fq
that refers directly to your $uq since that’s

what you say you actaully
want to filter on in the first place..

 

uq=…

cq=…

q={!type=func q.op=AND
df=text}product(query($uq,0.0),query($cq,0.1))

fq={!v=uq}

Boost score for early matches


Solr – How to boost score for early matches?

up vote 1 down vote
favorite

How can I boost the score
for documents in which my query matches a particular field earlier.
For example, searching for “super man” should give “super man
returns” a higher score than “there is my super man”. Is this
possible?

 

Uh, store the first few
words explicitly in another field, and boost matches on this field.
– aitchnyu Aug 22 at 9:45

 

The problem there is that
the size of the query can vary from say 3 characters to say 100
characters, and so determining how many words/chars to index
separately can be difficult. – techfoobar Aug 22 at 9:49

 

Secondly, suppose i index
the first 25 characters, and one record has “my super man blah..”
and another record has “super man returns blah..” – both will match
the query “super man” and both will be boosted when i boost this
secondary field. – techfoobar Aug 22 at 9:50

 

2 Answers

 

Thank you for the answer.
But i solved it today by using the approach i’ve outlined in my
answer. – techfoobar Aug 22 at 18:33

 

But this is not going to
work if the words do not occur at the very start. May want to check
out payloads as well where u can add index time suggestions as laid
down in the second option. – Jayendra Aug 22 at 18:35

 

Will check that out as well.
However, the current solution can be made to work to a large extent
by fine tuning the ps parameter to make it more lenient. I
currently use 2 (dist between 2 terms in the pf) and it seems to be
working quite well for my medium sized data set (1000s of records,
greatly varying in content). Will check out your point and let you
know if it helped. – techfoobar Aug 22 at 18:38

up
vote 0 down vote accepted       
Solved it myself after reading a LOT about this online. What
specifically helped me was a reply on nabble which goes like (I
used dismax, so explaining that here):


•      
Create a separate field named say ‘nameString’ which stores the
value as “START “


•      
Change the search query to “START “


•      
Add the new field nameString as one of the fields to look in in the
query fields param (qf)


•      
While searching use the parameter pf (phrase field) as the new
field nameString with a phrase slop of 1 or 2 (lower values would
mean stricter searching)

Your final query params will
be something like:

q=_START_

defType=dismax

qf=name
nameString

pf=nameString

ps=2

 


Solr: How can I get all
documents ordered by score with a list of
keywords?

I have a Solr 3.1 database
containing Emails with two fields:


•      
datetime


•      
text

For the query I have two
parameters:


•      
date of today


•      
keyword array(“important thing”, “important too”, “not so
important, but more than average”)

Is it possible to create a
query to

1.     
get ALL documents of this day AND

2.     
sort them by relevancy by ordering them so that the email with
contains most of my keywords(important things) scores
best?

The part with the date is
not very complicated:

fq=datetime[YY-MM-DDT00:00:00.000Z TO
YY-MM-DDT23:59:59.999Z]

I know that you can boost
the keywords this way:

q=text:”first keyword”^5 OR
text:”second one”^2 OR text:”minus scoring”^0.5 OR
text:”*”

But how do I only use the
keywords to sort this list and get ALL entries instead of doing a
realy query and get only a few entries back?

Thanks for help!

 

2 Answers

You need to specify your
terms in the main query and then change your date query to be a
filter query on these results by adding the following.

fq=datetime[YY-MM-DDT00:00:00.000Z TO
YY-MM-DDT23:59:59.999Z]

So you should have something
like this:

q=&fq=datetime[YY-MM-DDT00:00:00.000Z TO
YY-MM-DDT23:59:59.999Z]

Edit: A little more about
filter queries (as suggested by rfreak).

From Solr Wiki – FilterQuery
Guidance – “Now, what is a filter query? It is simply a part of a
query that is factored out for special treatment. This is achieved
in Solr by specifying it using the fq (filter query) parameter
instead of the q (main query) parameter. The same result could be
achieved leaving that query part in the main query. The difference
will be in query efficiency. That’s because the result of a filter
query is cached and then used to filter a primary query result
using set intersection.”

These should be sorted by
relevancy score already, that is just the default behavior of Solr.
You can see the score by adding that field.

fl=*,score

If you use the Full
Interface for Make A Query on the Admin Interface on your Solr
installation at http:////admin/form.jspyou will see where you can
specify the filter query, fields, and other options. You can check
out the Solr Wiki for more details on the options and how they are
used.

I hope that this helps
you.

 

+! The filter query is an
excellent suggestion. You may consider adding a bit about the
advantage of using the filter query there. – rfeak May 27 ’11 at
14:55

 

Thank you! The filter query
is working as expected. But unfortunately I still dont know how to
handle the keywords because they filter the emails instead of only
sort them. – Daniel May 27 ’11 at 16:06

Sorting by relevance is default behavior on solr/lucene.If
your results are unsatisfied, try to put the keywords in
quotes

//Edit: Folowing the answer
from Paige Cook, use somethink like that

q=”important
thing”&fq=datetime[YY-MM-DDT00:00:00.000Z TO
YY-MM-DDT23:59:59.999Z]

//2. nd update. By thinking
about this answer: quotes are not an good idea, because in this
case you will only receive “important thing” mails, but no
“important too”

The Point is: what keywords
you are using. Because: searching for — important thing — results
in the highest scores for “important thing” mails. But lucene does
not know, how to score “important too” or “not so important, but
more than average” in relation to your keywords. An other idea
would be searching only for “important”. But the field-values
“importand thing” and “importand too” gives nearly the same score
values,because 50% of the searched keywords (in this key:
“imported”) are part of the field-value. So probably you have to
change your keywords. It could work after changeing “importend to”
into “also an important mail”, to get the beast ratio of
search-word “important” and field-value in order to score the
shortest Mail-discripton to the highest value.

 

Thanks for your answer! You
point exactly to my problem because the keywords filter the
documents instead of only sorting them all an influencing the
relevancy score. I do not know how to handle this. – Daniel May 27
’11 at 16:13

Was this post useful to
you?    


Solr changes document’s
score when its random field value altered

http://stackoverflow.com/questions/6254587/solr-changes-documents-score-when-its-random-field-value-altered

1 down vote
favorite

I need to navigate forth and
back in Solr results set ordered by score viewing documents one by
one. To visualise that, first a list of document titles is
presented to user, then he or she can click one of the title to see
more details and then needs to have an opportunity to move to the
next document in the original list without getting back and
clicking another title.

During viewing documents get
changed: their dynamic field is modified (or created is not exists
yet) to mark that document has already been viewed (used in other
search).

The problem I face is that
when the document is altered and re-indexed to keep those changes,
sometimes (and not always, which is very disturbing) its place in
the results set for the same query changes (in other words, it’s
score changes as that doesn’t happen when browsing results sorted
by one of the documents’ fields). So, “Previous” / “Next”
navigation doesn’t work properly.

I’m not using any custom
weighting or boosters on fields for score calculation. Also, that
dynamic field changed during browsing doesn’t participate in the
query used to get the record set browsed.

So, the questions are: can
the modification of the document’s field not included in the query
change its relevance score? And if it can, then how can I control
that?

UPDATE

I did some tests and can add
the following:

1.     
Document changes its place in the result set even if no field is
amended – just requesting the document and re-indexing it without
any changes to its fields makes it take another place next time the
same query over the same index is executed.

2.     
That happens even if the result set is sorted explicitly
(“first_name DESC”), so score (which depends on the update date) is
not involved. The document stays the same, its field result set is
sorted by is the same, yet its position changes.

Still have no idea how to
avoid that.

 

2 Answers

In Solr, if your field is
“indexed”, it will have an effect on the relevancy ranking
(“stored” fields show up in search results but are not necessarily
searchable). If the fields in question aren’t marked as indexed
then you are good to go. Note that “indexed” and “stored” are not
necessarily the same, hence you confusion about results lists
changing even though not all fields are shown (a field can be
“indexed” and not “stored” as well).

In this case I think you
want your “viewed” field to be “stored” but not “indexed”. If you
really want to control the query, you can use copyField to copy the
relevant results into a single searchable field. You can also boost
terms or documents so that certain fields are “less important” to
the search query.

If you want to see how the
relevancy rankings are calculated, you can add “debugQuery=on” to
the end of your Solr Query (see the Relevancy FAQ for more
info).

However, all that being
said, I would recommend you cache your search result query (at
least for the first page for your results), since you will always
have results changing (documents added, removed by other users,
etc). Your best bet is to design a UI that anticipates this, or at
least batches a user’s query.

 

Thanks, for some reason I
was sure changes to fields not participating in the query don’t
affect the calculated score. In my case it is necessary to have
this field indexed as there is another query where I need to filter
documents searching only viewed or only not viewed before. Caching
is also not suitable as users is supposed to navigate through the
whole result set, not only through the page (well, caching still
possible and to be honest bearable in terms of resources but just
not elegant). I’ll try to boost the field being searched and tell
if that works. – Yuriy Jun 7 ’11 at 7:45

 

Just noticed that it also
happens when the results are sorted by other field than score. How
that’s possible? I thought if ordering is specified and score is
not in the clause explicitly (say, ordering is like “first_name
DESC”), it doesn’t influence the ordering. However, it seems it
does. How can I get rid of that? – Yuriy Jun 8 ’11 at
14:11

 

Okay, looks like boosting
works, but has no effect. If I boost the field I am searching in,
all the matches are boosted equally and still the recently
re-indexed documents get some delta in their relevance which makes
difference. There should be a way to exclude the date of last
update from the ordering completely but I can’t find it yet… –
Yuriy Jun 8 ’11 at 14:50

 

feedback

I’ve
found the solution which doesn’t eliminate the problem completely
but makes it much less likely to happen.

So the problem happens when
the documents are sorted by some field and there is a number of
them with the same value in this field (e.g. result set is sorted
by first name, and there are 100 entries for “John”).

This is when the indexed
time gets involved – apparently Solr uses it to sort the documents
when their main sorting fields are identical. To make this case
much less probable, you need to add more sorting fields, e.g.
“first_name desc” should become “first_name desc, last_name desc,
register_date asc”.

Also, adding document’s
unique id as the last sorting field should remove the problem
completely (the set of sorting fields will never be identical for
any two documents in the index).

share|improve this
answer

 

Relevance Customization

http://lucene.472066.n3.nabble.com/Relevance-Customization-td501310.html

Hi all.

I want to know if its
possible to customize the solr relevance, somehing

like this:

1 – I create a static score
for each document and index it.

2 – I change the relevance
to Score(Solr) + Score(Static) where the solr score is equal to 30%
of the total score. Mixing the two scores into only one.

 

This is defferent of sorting
by mine static socre and after by solr score because I don’t want
to kill solr score, just give it a little less
importance.

There is a way to do
this?

Thank’s

 

Re: Relevance Customization

It can be done with
something like q=yourQuery _val_:yourStaticScoreField

http://wiki.apache.org/solr/FunctionQuery#fieldvalue

 

But this adds solr score
with static score. I am not sure how to get 30% of solr score. May
be something like?

q=yourQuery^0.3 _val_:yourStaticScoreField^0.7

Modify
SOLR scoring

Hi everybody,

I’m using SOLR with a schema
(for example) like this:  parutiondate, date,
indexed, not stored

fulltext, stemmed, indexed,
not stored

 

I know it’s possible to
order by a field or more, but I want to order by score and modify
the “scrore”" formula.  I’ll want keep the SOLR
score but add a new parameter in the formula to boost the score of
the most recent document.

What is the best way to do
this ?

Thanks.

Excuse for my
english.

 

RE: modify SOLR scoring

I believe you can use a
function query to do this:

http://wiki.apache.org/solr/FunctionQuery

if you embed the following
in your query, you should get a boost for more recent date
values:

_val_:”ord(dateField)”

Where “dateField” is the
field name of the date you want to use.

 

Re: modify SOLR scoring

http://lucene.472066.n3.nabble.com/modify-SOLR-scoring-td497348.html

I am interested in a very
similar topic like yours. I want to modify the field named “score”
and the document boost but not reindex the all fields  since it would take to
much power.

Please let me know if you
find a solution to this.

Kindly


Change order before
returning data

http://stackoverflow.com/questions/4965172/change-order-before-returning-data

 

Is there any way to change
order of result in SOLR. E.g when I query in SOLR i will get 1000
records with highest score, then in those 1000 records I will use
my own
function to change order again and just get 10 records of
those records. I can get 1000 records and process by php or java,
but I have to transfer 1000 records from SOLR server to webserver
and I dont want that, I just want to get 10 records after changing
order and use paging. Is SOLR support this kind of custom
function?

 

 Answers

If you function can be
applied when the records are initially indexed, you can do it there
and add the result as a value on the record. Then sort the result
set by the precalculated
value. If not, i haven’t worked with it directly, but this
thread seems to have the answer you’re looking for

 

Hi My case is very special,
I had preindex score in database already. Let me give one example,
I have shopping site, when I search for TV LCD 32 inch, I got many
result from some different branch like LG, Toshiba … and may
result for LG appear consequently I want to separate it e.g I dont
want 3 results for LG sit next together, Currently I get 1000 best
records (base on score) and change the order again using PHP, now I
want to move this job to SOLR (I dont want transfer data to much
between SOLR and Webserver, I just need 10 records to display) –
user612433 Feb 11 ’11 at 3:45

 

Yes you can create a column
with the info you want to be taken into account into the
score.

For ex, for a “popularity”
column, your query would be:

your query &&
_val_:”popularity”^0.7

0.7 being the boost factor
into the final score. you can also filter the result set to get
less results:

your query &&
fq=popularity:[10 TO *]

 

 

limiting the total number of documents
matched

http://search-lucene.com/m/4AHNF17wIJW1/

 

Re: limiting the total
number of documents matched

Yonik Seeley 2010-07-17,
00:55

On Wed, Jul 14, 2010 at 5:46
PM, Paul <[EMAIL PROTECTED]>
wrote:

 I thought of another
way to do it, but I still have one thing I don’t know how to do. I
could do the search without sorting for the 50th page, then look at
the relevancy score on the first item on that page, then repeat the
search, but add score > that relevancy as a
parameter. Is it possible to do a search with “score:[5 to *]“?
It didn’t work in  my first
attempt.

 

frange could possible help (range query on an arbitrary
function).

http://www.lucidimagination.com/blog/tag/frange/

 

So perhaps something
like

q={!frange
l=0.85}query($qq)

qq=

 

where 0.85 is the lower
bound you want for scores and qq is the normal relevancy
query

-Yonik

http://www.lucidimagination.com

 

On Wed, Jul 14, 2010 at 5:34
PM, Paul <[EMAIL PROTECTED]>
wrote:

 I was hoping for a way
to do this purely by configuration and making the correct GET
requests, but if there is a way to do it by creating a custom
Request Handler, I suppose I could plunge into that. Would that
yield the best results, and would that be particularly
difficult?

 

>> On Wed, Jul 14, 2010 at
4:37 PM, Nagelberg, Kallin

So you want to take the top
1000 sorted by score, then sort those by another field. It’s a
strange case, and I can’t think of a clean way to accomplish it.
You could do it in two queries, where the first is by score and you
only request your IDs to keep it snappy, then do a second query
against the IDs and sort by your other field. 1000 seems like a lot
for that approach, but who knows until you try it on your
data.

>>> -Kallin
Nagelberg

 

>>> Subject:
limiting the total number of documents matched

I’d like to limit the total
number of documents that are returned for a search, particularly
when the sort order is not based on relevancy.In other words, if
the user searches for a very common term, they might get tens of
thousands of hits, and if they sort by “title”, then very high
relevancy documents will be interspersed with very low relevancy
documents. I’d like to set a limit to the 1000 most relevant
documents, then sort those by title. Is there a way to do
this?

 

I guess I could always
retrieve the top 1000 documents and sort them in the client, but
that seems particularly inefficient. I can’t find any other way to
do this, though.

相关 [solr 平台 搜索] 推荐:

Solr平台化搜索实战必知场景

- - 淘宝网综合业务平台团队博客
这个page是个人汇总了maillist、自己在搜索平台化、通用化过程中遇到的种种需求,为了避开必要的“敬业竞争禁止等”,特地从外网搜罗并汇总代表性的需求. 构成基于solr搜索“策略”参考、搜索应用查询的方案参考,但是,性能问题特别是高级用法,在大数据量时,务必压测,做到心里有底. 这里面给出的方法绝大部分基于solr接口、配置.

基于Solr的空间搜索(3)

- - 淘宝网综合业务平台团队博客
接上文,本文将继续介绍基于Solr的地理位置搜索的第二种实现方案. 从基于Solr的地理位置搜索(2)文章中可以看到完全基于GeoHash的查询过滤,将完全遍历整个docment文档,从效率上来看并不太合适,所以结合笛卡尔层后,能有效缩减少过滤范围,从性能上能很大程度的提高.       int tier = START_TIER;//开始构建索引的层数.

基于Solr的空间搜索(2)

- - 淘宝网综合业务平台团队博客
本文将继续围绕Solr+Lucene使用Cartesian Tiers 笛卡尔层和GeoHash的构建索引和查询的细节进行介绍. 在Solr中其实支持很多默认距离函数,但是基于坐标构建索引和查询的主要会基于2种方案:. 而这块的源码实现都在lucene-spatial.jar中可以找到. 接下来我将根据这2种方案展开关于构建索引和查询细节进行阐述,都是代码分析,感兴趣的看官可以继续往下看.

基于Solr的空间搜索(1)

- - 淘宝网综合业务平台团队博客
在Solr中基于空间地址查询主要围绕2个概念实现:. Cartesian Tiers 笛卡尔层. Cartesian Tiers是通过将一个平面地图的根据设定的层次数,将每层的分解成若干个网格,如下图所示:.  每层以2的评方递增,所以第一层为4个网格,第二层为16 个,所以整个地图的经纬度将在每层的网格中体现:.

Solr SpellCheck 应用

- - 开源软件 - ITeye博客
通过对各类型的SpellCheck组件学习,完成项目拼写检查功能. 本文使用基于拼写词典的实现方式,solr版本为5.3.0. SpellCheck 简述. 拼写检查是对用户错误输入,响应正确的检查建议. 比如输入:周杰轮,响应:你是不是想找 周杰伦. Solr的拼写检查大致可分为两类,基于词典与基于Solr索引.

Solr DocValues详解

- - 企业架构 - ITeye博客
什么是docValues. docValues是一种记录doc字段值的一种形式,在例如在结果排序和统计Facet查询时,需要通过docid取字段值的场景下是非常高效的. 为什么要使用docValues. 这种形式比老版本中利用fieldCache来实现正排查找更加高效,更加节省内存. 倒排索引将字段内存切分成一个term列表,每个term都对应着一个docid列表,这样一种结构使得查询能够非常快速,因为term对应的docid是现成就有的.

solr的使用

- - Web前端 - ITeye博客
solr的原理不和大家一一讲述,主要讲solr在使用过程中的注意事项.  首先是安装solr,安装步骤省略. (不要说我懒,安装步骤导出都是. 成功之后 需要在solr里面建立一个针对你的业务的服务,我想建立一个叫做discuz的服务. 然后你在你的solr目录 :solr-5.5.3/server/solr/  下看见了discuz   ,这是你刚刚创建的,针对某一业务的整个搜索配置都是在这个目录下配置的.

Solr调优参考

- - 淘宝网通用产品团队博客
共整理三部分,第一部分Solr常规处理,第二部分针对性性处理,前者比较通用,后者有局限性. 务必根据具体应用特性,具体调节参数,对比性能. 具体应用需要全面去把控,各个因素一起起作用. 第一部分. E文连接 http://wiki.apache.org/solr/SolrPerformanceFactors.

Solr之缓存篇

- - 淘宝网综合业务平台团队博客
Solr在Lucene之上开发了很多Cache功能,从目前提供的Cache类型有:. 而每种Cache针对具体的查询请求进行对应的Cache. 本文将从几个方面来阐述上述几种Cache在Solr的运用,具体如下:. (1)Cache的生命周期. (2)Cache的使用场景. (3)Cache的配置介绍.

Solr主从备份

- - 研发管理 - ITeye博客
SOLR复制模式,是一种在分布式环境下用于同步主从服务器的一种实现方式,因之前提到的基于rsync的SOLR不同方式部署成本过高,被SOLR1.4版本所替换,取而代之的就是基于HTTP协议的索引文件传输机制,该方式部署简单,只需配置一个文件即可. 以下讲解具体操作步骤: . 步骤分主服务器和从服务器,允许有多个从服务器,即从服务器的配置一样.