Code Samples - Zoie - Confluence
Zoie is a real-time search and indexing system built on Apache Lucene.
Donated by LinkedIn.com on July 19, 2008, and has been deployed in a real-time large-scale consumer website: LinkedIn.com handling millions of searches as well as millions of updates daily.
Configuration
Zoie can be configured via Spring:
Basic Search
This example shows how to set up basic indexing and search
thread 1: (indexing thread)
thread 2: (search thread)
Apache Solr vs ElasticSearch - the Feature Smackdown!
API
Feature | Solr 4.7.0 | ElasticSearch 1.0 |
---|---|---|
Format | XML,CSV,JSON | JSON |
HTTP REST API | ||
Binary API | SolrJ | TransportClient, Thrift (through a plugin) |
JMX support | ES specific stats are exposed through the REST API | |
Client libraries | PHP, Ruby, Perl, Scala, Python, .NET, Javascript | PHP, Ruby, Perl, Scala, Python, .NET, Javascript, Erlang, Clojure |
3rd-party product integration (open-source) | Drupal, Magento, Django, ColdFusion, Wordpress, OpenCMS, Plone, Typo3, ez Publish, Symfony2, Riak (via Yokozuna) | Drupal, Django, Symfony2, Wordpress, CouchBase |
3rd-party product integration (commercial) | DataStax Enterprise Search, Cloudera Search, Hortonworks Data Platform, MapR | SearchBlox, Hortonworks Data Platform, MapR |
Output | JSON, XML, PHP, Python, Ruby, CSV, Velocity, XSLT, native Java | JSON, XML/HTML (via plugin) |
Indexing
Searching
Feature | Solr 4.7.0 | ElasticSearch 1.0 |
---|---|---|
Lucene Query parsing | ||
Structured Query DSL | Need to programmatically create queries if going beyond Lucene query syntax. | |
Span queries | via SOLR-2703 | |
Spatial search | ||
Multi-point spatial search | ||
Faceting | The way top N facets work now is by getting the top N from each shard, and merging the results. This can giveincorrect counts when num shards > 1. | |
Advanced Faceting | blog post | |
Pivot Facets | ||
More Like This | ||
Boosting by functions | ||
Boosting using scripting languages | ||
Push Queries | JIRA issue | Percolation. Distributed percolation supported in 1.0 |
Field collapsing/Results grouping | possibly 1.0+ link | |
Spellcheck | Suggest API | |
Autocomplete | Added in 0.90.3 here | |
Query elevation | workaround | |
Joins | It's not supported in distributed search. See LUCENE-3759. | via has_children and top_children queries |
Resultset Scrolling | New to 4.7.0 | via scan search type |
Filter queries | also supports filtering by native scripts | |
Filter execution order | local params and cache property | _cache and _cache_key property |
Alternative QueryParsers | DisMax, eDisMax | query_string, dis_max, match, multi_match etc |
Negative boosting | but awkward. Involves positively boosting the inverse set of negatively-boosted documents. | |
Search across multiple indexes | it can search across multiple compatible collections | |
Result highlighting | ||
Custom Similarity | ||
Searcher warming on index reload | Warmers API |
Customizability
Distributed
Feature | Solr 4.7.0 | ElasticSearch 1.0 |
---|---|---|
Self-contained cluster | Depends on separate ZooKeeper server | Only ElasticSearch nodes |
Automatic node discovery | ZooKeeper | internal Zen Discovery or ZooKeeper |
Partition tolerance | The partition without a ZooKeeper quorum will stop accepting indexing requests or cluster state changes, while the partition with a quorum continues to function. | Partitioned clusters can diverge unless discovery.zen.minimum_master_nodes set to at least N/2+1, where N is the size of the cluster. If configured correctly, the partition without a quorum will stop operating, while the other continues to work. See this |
Automatic failover | If all nodes storing a shard and its replicas fail, client requests will fail, unless requests are made with the shards.tolerant=true parameter, in which case partial results are retuned from the available shards. | |
Automatic leader election | ||
Shard replication | ||
Sharding | ||
Automatic shard rebalancing | it can be machine, rack, availability zone, and/or data center aware. Arbitrary tags can be assigned to nodes and it can be configured to not assign the same shard and its replicates on a node with the same tags. | |
Change # of shards | Shards can be added (when using implicit routing) or split (when using compositeId). Cannot be lowered. Replicas can be increased anytime. | each index has 5 shards by default. Number of primary shards cannot be changed once the index is created. Replicas can be increased anytime. |
Relocate shards and replicas | can be done by creating a shard replicate on the desired node and then removing the shard from the source node | can move shards and replicas to any node in the cluster on demand |
Control shard routing | shards or _route_ parameter | routing parameter |
Consistency | Indexing requests are synchronous with replication. A indexing request won't return until all replicas respond. No check for downed replicas. They will catch up when they recover. When new replicas are added, they won't start accepting and responding to requests until they are finished replicating the index. | Replication between nodes is synchronous by default, thus ES is consistent by default, but it can be set to asynchronous on a per document indexing basis. Index writes can be configured to fail is there are not sufficient active shard replicas. The default is quorum, but all or one are also available. |
Misc
Feature | Solr 4.7.0 | ElasticSearch 1.0 |
---|---|---|
Web Admin interface | bundled with Solr | via site plugins: elasticsearch-head, bigdesk, kopf,elasticsearch-HQ, Hammer |
Hosting providers | WebSolr, Searchify, Hosted-Solr, IndexDepot, OpenSolr,gotosolr | bonsai.io, Indexisto, qbox.io, IndexDepot |
Thoughts...
As a number of folks point out in the discussion below, feature comparisons are inherently shallow and only go so far. I think they serve a purpose, but shouldn't be taken to be the last word on these 2 fantastic search products.
If you're running a smallish site and need search features without fancy bells-and-whistles, I think you'll be very happy with either Solr or ElasticSearch.
I've found ElasticSearch to be friendlier to teams which are used to REST APIs, JSON etc and don't have a Java background. If you're planning a large installation that requires running distributed search instances, I suspect you're also going to be happier with ElasticSearch.
As Matt Weber points out below, ElasticSearch was built to be distributed from the ground up, not tacked on as an 'afterthought' like it was with Solr. This is totally evident when examining the design and architecture of the 2 products, and also when browsing the source code.
Resources
- My other sites may be of interest if you're new to Lucene, Solr and ElasticSearch:
- The Solr wiki and the ElasticSearch Guide are your friends.
使用Lucene-Spatial实现集成地理位置的全文检索 - haiker - ITeye技术网站
Lucene通过Spatial包提供了对基于地理位置的全文检索的支持,最典型的应用场景就是:“搜索中关村附近1公里内的火锅店,并按远近排序”。使用Lucene-Spatial添加对地理位置的支持,和之前普通文本搜索主要有两点区别:
1. 将坐标信息转化为笛卡尔层,建立索引
- private void indexLocation(Document document, JSONObject jo)
- throws Exception {
- double longitude = jo.getDouble("longitude");
- double latitude = jo.getDouble("latitude");
- document.add(new Field("lat", NumericUtils
- .doubleToPrefixCoded(latitude), Field.Store.YES,
- Field.Index.NOT_ANALYZED));
- document.add(new Field("lng", NumericUtils
- .doubleToPrefixCoded(longitude), Field.Store.YES,
- Field.Index.NOT_ANALYZED));
- for (int tier = startTier; tier <= endTier; tier++) {
- ctp = new CartesianTierPlotter(tier, projector,
- CartesianTierPlotter.DEFALT_FIELD_PREFIX);
- final double boxId = ctp.getTierBoxId(latitude, longitude);
- document.add(new Field(ctp.getTierFieldName(), NumericUtils
- .doubleToPrefixCoded(boxId), Field.Store.YES,
- Field.Index.NOT_ANALYZED_NO_NORMS));
- }
- }
2. 搜索时,指定使用DistanceQueryFilter
- DistanceQueryBuilder dq = new DistanceQueryBuilder(latitude,
- longitude, miles, "lat", "lng",
- CartesianTierPlotter.DEFALT_FIELD_PREFIX, true, startTier,
- endTier);
- DistanceFieldComparatorSource dsort = new DistanceFieldComparatorSource(
- dq.getDistanceFilter());
- Sort sort = new Sort(new SortField("geo_distance", dsort));
下面是基于Lucene3.2.0和JUnit4.8.2的完整代码。
- <dependencies>
- <dependency>
- <groupId>junit</groupId>
- <artifactId>junit</artifactId>
- <version>4.8.2</version>
- <type>jar</type>
- <scope>test</scope>
- </dependency>
- <dependency>
- <groupId>org.apache.lucene</groupId>
- <artifactId>lucene-core</artifactId>
- <version>3.2.0</version>
- <type>jar</type>
- <scope>compile</scope>
- </dependency>
- <dependency>
- <groupId>org.apache.lucene</groupId>
- <artifactId>lucene-spatial</artifactId>
- <version>3.2.0</version>
- <type>jar</type>
- <scope>compile</scope>
- </dependency>
- <dependency>
- <groupId>org.json</groupId>
- <artifactId>json</artifactId>
- <version>20100903</version>
- <type>jar</type>
- <scope>compile</scope>
- </dependency>
- </dependencies>
首先准备测试用的数据:
- {"id":12,"title":"时尚码头美容美发热烫特价","longitude":116.3838183,"latitude":39.9629015}
- {"id":17,"title":"审美个人美容美发套餐","longitude":116.386564,"latitude":39.966102}
- {"id":23,"title":"海底捞吃300送300","longitude":116.38629,"latitude":39.9629573}
- {"id":26,"title":"仅98元!享原价335元李老爹","longitude":116.3846175,"latitude":39.9629125}
- {"id":29,"title":"都美造型烫染美发护理套餐","longitude":116.38629,"latitude":39.9629573}
- {"id":30,"title":"仅售55元!原价80元的老舍茶馆相声下午场","longitude":116.0799914,"latitude":39.9655391}
- {"id":33,"title":"仅售55元!原价80元的新笑声客栈早场","longitude":116.0799914,"latitude":39.9655391}
- {"id":34,"title":"仅售39元(红色礼盒)!原价80元的平谷桃","longitude":116.0799914,"latitude":39.9655391}
- {"id":46,"title":"仅售38元!原价180元地质礼堂白雪公主","longitude":116.0799914,"latitude":39.9655391}
- {"id":49,"title":"仅99元!享原价342.7元自助餐","longitude":116.0799914,"latitude":39.9655391}
- {"id":58,"title":"桑海教育暑期学生报名培训九折优惠券","longitude":116.0799914,"latitude":39.9655391}
- {"id":59,"title":"全国发货:仅29元!贝玲妃超模粉红高光光","longitude":116.0799914,"latitude":39.9655391}
- {"id":65,"title":"海之屿生态水族用品店抵用券","longitude":116.0799914,"latitude":39.9655391}
- {"id":67,"title":"小区东门时尚烫染个人护理美发套餐","longitude":116.3799914,"latitude":39.9655391}
- {"id":74,"title":"《郭德纲相声专辑》CD套装","longitude":116.0799914,"latitude":39.9655391}
根据上面的测试数据,编写测试用例,分别搜索坐标(116.3838183, 39.9629015)3千米以内的“美发”和全部内容,分别得到的结果应该是4条和6条。
- import static org.junit.Assert.assertEquals;
- import static org.junit.Assert.fail;
- import java.util.List;
- import org.junit.Test;
- public class LuceneSpatialTest {
- private static LuceneSpatial spatialSearcher = new LuceneSpatial();
- @Test
- public void testSearch() {
- try {
- long start = System.currentTimeMillis();
- List<String> results = spatialSearcher.search("美发", 116.3838183, 39.9629015, 3.0);
- System.out.println(results.size()
- + "个匹配结果,共耗时 "
- + (System.currentTimeMillis() - start) + "毫秒。\n");
- assertEquals(4, results.size());
- } catch (Exception e) {
- fail("Exception occurs...");
- e.printStackTrace();
- }
- }
- @Test
- public void testSearchWithoutKeyword() {
- try {
- long start = System.currentTimeMillis();
- List<String> results = spatialSearcher.search(null, 116.3838183, 39.9629015, 3.0);
- System.out.println( results.size()
- + "个匹配结果,共耗时 "
- + (System.currentTimeMillis() - start) + "毫秒.\n");
- assertEquals(6, results.size());
- } catch (Exception e) {
- fail("Exception occurs...");
- e.printStackTrace();
- }
- }
- }
下面是LuceneSpatial类,在构造函数中初始化变量和创建索引:
- public class LuceneSpatial {
- private Analyzer analyzer;
- private IndexWriter writer;
- private FSDirectory indexDirectory;
- private IndexSearcher indexSearcher;
- private IndexReader indexReader;
- private String indexPath = "c:/lucene-spatial";
- // Spatial
- private IProjector projector;
- private CartesianTierPlotter ctp;
- public static final double RATE_MILE_TO_KM = 1.609344; //英里和公里的比率
- public static final String LAT_FIELD = "lat";
- public static final String LON_FIELD = "lng";
- private static final double MAX_RANGE = 15.0; // 索引支持的最大范围,单位是千米
- private static final double MIN_RANGE = 3.0; // 索引支持的最小范围,单位是千米
- private int startTier;
- private int endTier;
- public LuceneSpatial() {
- try {
- init();
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- private void init() throws Exception {
- initializeSpatialOptions();
- analyzer = new StandardAnalyzer(Version.LUCENE_32);
- File path = new File(indexPath);
- boolean isNeedCreateIndex = false;
- if (path.exists() && !path.isDirectory())
- throw new Exception("Specified path is not a directory");
- if (!path.exists()) {
- path.mkdirs();
- isNeedCreateIndex = true;
- }
- indexDirectory = FSDirectory.open(new File(indexPath));
- //建立索引
- if (isNeedCreateIndex) {
- IndexWriterConfig indexWriterConfig = new IndexWriterConfig(
- Version.LUCENE_32, analyzer);
- indexWriterConfig.setOpenMode(OpenMode.CREATE_OR_APPEND);
- writer = new IndexWriter(indexDirectory, indexWriterConfig);
- buildIndex();
- }
- indexReader = IndexReader.open(indexDirectory, true);
- indexSearcher = new IndexSearcher(indexReader);
- }
- @SuppressWarnings("deprecation")
- private void initializeSpatialOptions() {
- projector = new SinusoidalProjector();
- ctp = new CartesianTierPlotter(0, projector,
- CartesianTierPlotter.DEFALT_FIELD_PREFIX);
- startTier = ctp.bestFit(MAX_RANGE / RATE_MILE_TO_KM);
- endTier = ctp.bestFit(MIN_RANGE / RATE_MILE_TO_KM);
- }
- private int mile2Meter(double miles) {
- double dMeter = miles * RATE_MILE_TO_KM * 1000;
- return (int) dMeter;
- }
- private double km2Mile(double km) {
- return km / RATE_MILE_TO_KM;
- }
创建索引的具体实现:
- private void buildIndex() {
- BufferedReader br = null;
- try {
- //逐行添加测试数据到索引中,测试数据文件和源文件在同一个目录下
- br = new BufferedReader(new InputStreamReader(
- LuceneSpatial.class.getResourceAsStream("data")));
- String line = null;
- while ((line = br.readLine()) != null) {
- index(new JSONObject(line));
- }
- writer.commit();
- } catch (Exception e) {
- e.printStackTrace();
- } finally {
- if (br != null) {
- try {
- br.close();
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- }
- }
- private void index(JSONObject jo) throws Exception {
- Document doc = new Document();
- doc.add(new Field("id", jo.getString("id"), Field.Store.YES,
- Field.Index.ANALYZED));
- doc.add(new Field("title", jo.getString("title"), Field.Store.YES,
- Field.Index.ANALYZED));
- //将位置信息添加到索引中
- indexLocation(doc, jo);
- writer.addDocument(doc);
- }
- private void indexLocation(Document document, JSONObject jo)
- throws Exception {
- double longitude = jo.getDouble("longitude");
- double latitude = jo.getDouble("latitude");
- document.add(new Field("lat", NumericUtils
- .doubleToPrefixCoded(latitude), Field.Store.YES,
- Field.Index.NOT_ANALYZED));
- document.add(new Field("lng", NumericUtils
- .doubleToPrefixCoded(longitude), Field.Store.YES,
- Field.Index.NOT_ANALYZED));
- for (int tier = startTier; tier <= endTier; tier++) {
- ctp = new CartesianTierPlotter(tier, projector,
- CartesianTierPlotter.DEFALT_FIELD_PREFIX);
- final double boxId = ctp.getTierBoxId(latitude, longitude);
- document.add(new Field(ctp.getTierFieldName(), NumericUtils
- .doubleToPrefixCoded(boxId), Field.Store.YES,
- Field.Index.NOT_ANALYZED_NO_NORMS));
- }
- }
搜索的具体实现:
- public List<String> search(String keyword, double longitude,
- double latitude, double range) throws Exception {
- List<String> result = new ArrayList<String>();
- double miles = km2Mile(range);
- DistanceQueryBuilder dq = new DistanceQueryBuilder(latitude,
- longitude, miles, "lat", "lng",
- CartesianTierPlotter.DEFALT_FIELD_PREFIX, true, startTier,
- endTier);
- //按照距离排序
- DistanceFieldComparatorSource dsort = new DistanceFieldComparatorSource(
- dq.getDistanceFilter());
- Sort sort = new Sort(new SortField("geo_distance", dsort));
- Query query = buildQuery(keyword);
- //搜索结果
- TopDocs hits = indexSearcher.search(query, dq.getFilter(),
- Integer.MAX_VALUE, sort);
- //获得各条结果相对应的距离
- Map<Integer, Double> distances = dq.getDistanceFilter()
- .getDistances();
- for (int i = 0; i < hits.totalHits; i++) {
- final int docID = hits.scoreDocs[i].doc;
- final Document doc = indexSearcher.doc(docID);
- final StringBuilder builder = new StringBuilder();
- builder.append("找到了: ")
- .append(doc.get("title"))
- .append(", 距离: ")
- .append(mile2Meter(distances.get(docID)))
- .append("米。");
- System.out.println(builder.toString());
- result.add(builder.toString());
- }
- return result;
- }
- private Query buildQuery(String keyword) throws Exception {
- //如果没有指定关键字,则返回范围内的所有结果
- if (keyword == null || keyword.isEmpty()) {
- return new MatchAllDocsQuery();
- }
- QueryParser parser = new QueryParser(Version.LUCENE_32, "title",
- analyzer);
- parser.setDefaultOperator(Operator.AND);
- return parser.parse(keyword.toString());
- }
执行测试用例,可以得到下面的结果:
- 找到了: 时尚码头美容美发热烫特价, 距离: 0米。
- 找到了: 都美造型烫染美发护理套餐, 距离: 210米。
- 找到了: 审美个人美容美发套餐, 距离: 426米。
- 找到了: 小区东门时尚烫染个人护理美发套餐, 距离: 439米。
- 4个匹配结果,共耗时 119毫秒。
- 找到了: 时尚码头美容美发热烫特价, 距离: 0米。
- 找到了: 仅98元!享原价335元李老爹, 距离: 68米。
- 找到了: 海底捞吃300送300, 距离: 210米。
- 找到了: 都美造型烫染美发护理套餐, 距离: 210米。
- 找到了: 审美个人美容美发套餐, 距离: 426米。
- 找到了: 小区东门时尚烫染个人护理美发套餐, 距离: 439米。
- 6个匹配结果,共耗时 3毫秒.
参考文献:
Lucene-Spatial的原理介绍:http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene.htm
GeoHash:http://en.wikipedia.org/wiki/Geohash
两篇示例(其中大部分代码就来自于这里):
Lucene Spatial Example