Elasticsearch Performance Tuning Practice at eBay

标签: | 发表时间:2018-01-09 14:15 | 作者:
出处:https://www.ebayinc.com

Elasticsearch is an open source search and analytic engine based on Apache Lucene that allows users to store, search, analyze data in near real time. Pronto, the platform that hosts Elasticsearch clusters at eBay, makes it easy for eBay internal customers to deploy, operate, and scale Elasticsearch for full-text search, real-time analysis, and log/event monitoring. Today there are 60+ Elasticsearch clusters and 2000+ nodes managed by Pronto. The daily ingestion reaches 18 billion documents, and daily search requests reach 3.5 billion. The platform offers a full spectrum of value from provision, remediation, and security to monitoring, alerting, and diagnostics.

While Elasticsearch is designed for fast queries, the performance depends largely on the scenarios that apply to your application, the volume of data you are indexing, and the rate at which applications and users query your data. This document summarizes the challenges as well as the process and tools that the Pronto team builds to address the challenges in a strategic way. It also shows certain results of benchmarking various configurations for illustration.

Challenges

The challenges for the Pronto/Elasticsearch use cases observed so far include:

  1. High throughput: Some clusters have up to 5TB data ingested per day, and some clusters take more than 400 million search requests per day. Requests would accumulate at upstream if Elasticsearch could not handle them in time.
  2. Low search latency: For performance-critical clusters, especially for site-facing systems, a low search latency is mandatory, otherwise user experience would be impacted.
  3. Optimal settings always change since the data or query is mutable. There is no optimal setting for all scenarios. For example, splitting an index into more shards would be good for time-consuming queries, but it may hurt other query performance. 

Solutions

To help our customer address these challenges, the Pronto team builds strategic ways for performance testing, tuning, and monitoring, starting from user case onboarding and continuing throughout the cluster life cycle.

  1. Sizing: Before a new use case comes onboard, collect customer-provided information like throughput, document size, document count, and search type to estimate the Elasticsearch cluster initial size.
  2. Optimize index design: Review the index design with the customer.
  3. Tune index performance:Tune indexing performance and search performance based on the user scenario.
  4. Tune search performance: Run performance tests with user real data/query, compare and analyze test results with combinations of Elasticsearch configuration parameters.
  5. Run performance tests: After the case is onboard, the cluster is monitored, and users are free to re-run performance test whenever data changes, query changes, or traffic increases. 

Sizing

The Pronto team runs benchmarks for every type of machine and every supported Elasticsearch version to gather performance data, and then uses it with customer-provided information to estimate cluster initial size, including:

  • Indexing throughput
  • Document size
  • Search throughput
  • Query type
  • Hot index document count
  • Retention policy
  • Response time requirement
  • SLA level

Optimize index design

Let’s think twice before starting to ingest data and run queries. What does the index stand for? The Elastic official answer is “a collection of documents that have somewhat similar characteristics.” So, the next question is “Which characteristics should I used to group my data? Should I put all documents into one index or multiple indices?” The answer is that it depends on the query you used. Below are some suggestions about how to organize indices according to the query that used most frequently.

  • Split your data into multiple indices if your query has a filter field and its value is enumerable.For example, you have a lot of global products information ingested to Elasticsearch, most of your queries have a filter clause “region,” and there are few chances to run cross-region queries. A query body can be optimized:
{
    "query": {
        "bool": {
            "must": {
                "match": {
                    "title": "${title}"
                }
            },
            "filter": {
                "term": {
                    "region": "US"
                }
            }
        }
    }
}

Under this scenario, we can get better performance if the index is split into several smaller indices based on region, like US, Euro, and others. Then the filter clause can be removed from the query. If we need to run a cross-region query, we can just pass multiple indices or wildcards to Elasticsearch.

  • Use routing if your query has a filter field and its value is not enumerable.We can split the index into multiple shards by using the filter field value as a routing key and removing remove the filter.

For example, there are millions of orders ingested to Elasticsearch, and most queries need to query orders by buyer ID. It’s impossible to create an index for every buyer, so we cannot split data into multiple indices by buyer ID. A proper solution is to use routing to put all orders with the same buyer ID into the same shard. Then nearly all of your queries can be completed within the shard matching the routing key. 

  • Organize data by date if your query has a date range filter.This works for most logging or monitoring scenarios. We can organize indices by daily, weekly, or monthly, and then we can get an index list by a specified date range. Elasticsearch only needs to query on a smaller data set instead of the whole data set. Plus, it would be easy to shrink/delete old indices when data has expired.
  • Set mapping explicitly.Elasticsearch can create mapping dynamically, but it might be not suitable for all scenarios. For example, the default string field mappings in Elasticsearch 5.x are both "keyword" and "text" types. It's unnecessary in a lot of scenarios. 
  • Avoid imbalanced sharding if documents are indexed with user-defined ID or routing. Elasticsearch uses a random ID generator and hash algorithm to make sure documents are allocated to shards evenly. When you use a user-defined ID or routing, the ID or routing key might not be random enough, and some shards may be obviously bigger than others. In this scenario, a read/write operation on this shard would be much slower than others. We can optimize the ID/routing key or use index.routing_partition_size (available in 5.3 and later).
  • Make shards distributed evenly across nodes.If one node has more shards than other nodes, it will take more load than other nodes and may become the bottle neck of whole system.

Tune indexing performance

For indexing heavy scenarios like logging and monitoring, indexing performance is the key metric. Here are some suggestions.

  • Use bulk requests.
  • Use multiple threads/works to send requests.
  • Increase the refresh interval.Every time a refresh event happens, Elasticsearch creates a new Lucene segment and merges them later. Increasing the refresh interval would reduce the cost of creating/merging. Please note, documents only can be available for search after a refresh happens. 
Relationship between performance and refresh interval

Relationship between performance and refresh interval

From the above diagram, we can see that the throughput increased and response time decreased as the refresh interval increased. We can use the request below to check how many segments we have and how much time is spent on refresh and merge.

Index/_stats?filter_path= indices.**.refresh,indices.**.segments,indices.**.merges
  • Reduce replica number.Elasticsearch needs to write documents to the primary and all replica shards for every indexing request. Obviously, a big replica number would slow down indexing speed, but on the other side, it would improve search performance. We will talk about it later in this article.
Relationship between performance and replica number

Relationship between performance and replica number 

From above diagram, we can see that throughput decreased and response time increased as the replica number increased.

  • Use auto generated IDs if possible. An Elasticsearch auto generated ID is guaranteed to be unique to avoid version lookup. If a customer really needs to use a self-defined ID, our suggestion is to pick an ID that is friendly to Lucene, such as zero-padded sequential IDs, UUID-1, or Nano time. These IDs have consistent, sequential patterns that compress well. In contrast, IDs such as UUID-4 are essentially random, which offers poor compression and slows Lucene down.

Tune search performance

A primary reason for using Elasticsearch is to support searches through data. Users should be able to quickly locate the information they are looking for. Search performance depends on quite a few factors.

  • Use filter context instead of query context if possible. A query clause is used to answer “How well does this document match this clause?” A filter clause is used to answer “Does this document match this clause?” Elasticsearch only needs to answer “Yes” or “No.” It does not need to calculate a relevancy score for a filter clause, and the filter results can be cached. See Query and filter contextfor details. 
Compare between query and filter

Compare between query and filter

  • Increase refresh interval.As we mentioned in the  tune indexing performance section, Elasticsearch creates new segment every time a refresh happens. Increasing the refresh interval would help reduce the segment count and reduce the IO cost for search. And, the cache would be invalid once a refresh happens and data is changed. Increasing the refresh interval can make Elasticsearch utilize cache more efficiently.
  • Increase replica number.Elasticsearch can perform a search on either a primary or replica shard. The more replicas you have, the more nodes can be involved in your search.
Relationship between performance and replica number

Relationship between performance and replica number

From above diagram, you can see the search throughput is nearly linear to the replica number. Note in this test, the test cluster has enough data nodes to ensure every shard has an exclusive node. If this condition cannot be satisfied, search throughput would not be as good.

  • Try different shard numbers."How many shards should I set for my index?" This may be the most frequently question we've seen. Unfortunately, there is no correct number for all scenarios. It fully depends on your case.

Too small a shard number would make the search unable to scale out. For example, if the shard number is set to 1, all documents in your index would be stored in one shard. For every search, only one node can be involved. It’s time consuming if you have a lot of documents. From another side, creating an index with too many shards is also harmful to performance, because Elasticsearch needs to run queries on all shards, unless a routing key is specified in the request, then fetch and merge all returned results together.

From our experience, if the index is smaller than 1G, it’s fine to set the shard number to 1. For most scenarios, we can leave the shard number as the default value 5, but if shard size exceeds 30GB, we should increase the shard number to split the index into more shards. The shard number cannot be changed once an index is created, but we can create a new index and use the reindex API to move data.

We tested an index that has 100 million documents and is about 150GB. We used 100 threads to send search requests.

Relationship between performance and shard number

Relationship between performance and shard number

From above diagram, we can see the optimized shard number is 11. Search throughput increased (Response time decrease) at the beginning, but decreased (Response time increase) as the shard number keeps increasing.

Note that in this test, just like in the replica number test, every shard has an exclusive node. If this condition cannot be satisfied, search throughput would not be as good as this diagram.

In this case, we would like to recommend you try a shard number less than the optimized value, since it would need a lot of nodes if you use big shard number, and make every shard have an exclusive data node.

  • Node query cache.  Node query cache only caches queries that are being used in a filter context. Unlike a query clause, a filter clause is a "Yes" or "No" question. Elasticsearch used a bit set mechanism to cache filter results, so that later queries with the same filter will be accelerated. Note that only segments that hold more than 10,000 documents (or 3% of the total documents, whichever is larger) will enable a query cache. For more details, see  All about caching.

We can use the following request to check whether a node query cache is having an effect.

GET index_name/_stats?filter_path=indices.**.query_cache
{
  "indices": {
    "index_name": {
      "primaries": {
        "query_cache": {
          "memory_size_in_bytes": 46004616,
          "total_count": 1588886,
          "hit_count": 515001,
          "miss_count": 1073885,
          "cache_size": 630,
          "cache_count": 630,
          "evictions": 0
        }
      },
      "total": {
        "query_cache": {
          "memory_size_in_bytes": 46004616,
          "total_count": 1588886,
          "hit_count": 515001,
          "miss_count": 1073885,
          "cache_size": 630,
          "cache_count": 630,
          "evictions": 0
        }
      }
    }
  }
}
  • Shard query cache.If most of the queries are aggregate queries, we should look at the  shard query cache, which can cache the aggregate results so that Elasticsearch will serve the request directly with little cost. There are several things to take care with: 
    • Set "size":0. A shard query cache only caches aggregate results and suggestion. It doesn't cache hits, so that if you set size to non-zero, you cannot benefit from caching.
    • Payload JSON must be the same. A shard query cache uses JSON body as the cache key, so you need to ensure the JSON body doesn't change and make sure the keys in the JSON body are in the same order.
    • Round your date time. Do not use variables like Date.now in your query directly. Round it. Otherwise you will have a different payload body for every request, which makes the cache always invalid. We suggest you round your date time to hour or day to utilize a cache more efficiently.

We can use the request below to check whether the shard query cache has an effect.

GET index_name/_stats?filter_path=indices.**.request_cache
{
  "indices": {
    "index_name": {
      "primaries": {
        "request_cache": {
          "memory_size_in_bytes": 0,
          "evictions": 0,
          "hit_count": 541,
          "miss_count": 514098
        }
      },
      "total": {
        "request_cache": {
          "memory_size_in_bytes": 0,
          "evictions": 0,
          "hit_count": 982,
          "miss_count": 947321
        }
      }
    }
  }
}
  • Retrieve only necessary fields.If your documents are large, and you need only a few fields, use  stored_fields to retrieve fields you need instead of all fields.
  • Avoid searching stop words.Stop words like “a" and "the” may cause the query hit results count to explode. Image you have a million documents. Searching for “fox” may return tens of hits, but searching “the fox” may return all documents in your index since “the” appeared in nearly all documents. Elasticsearch needs to score and sort all hit results, so that a query like “the fox” slows down whole system. You can use the stop token filter to remove stop words, or use the “and” operator to change the query from “the fox” to “the AND fox” to get more a precise hit result.

If some words are frequently used in your index but not in the default stop words list, you can use cutoff-frequencyto handle them dynamically.

  • Sort by _doc if you don’t care about the order in which documents are returned.Elasticsearch uses the “_score” field to sort by score as default. If you don’t care about the order, you can use “sort”: “_doc” to let Elasticsearch return hits by index order.
  • Avoid using a script query to calculate hits in flight. Store the calculated fields when indexing.For example, we have an index with a lot of user information, and we need to query all users whose number start with "1234." You might want to run a script query like "source": "doc['num'].value.startsWith('1234')." This query is really resource-consuming and slows down the whole system. Consider adding a field named "num_prefix" when indexing. Then we can just query "name_prefix": "1234."
  • Avoid wildcard queries.

Run performance tests

For every change, it’s necessary to run performance tests to verify whether the change is applicable. Because Elasticsearch is a restful service, you can use tools like Rally, Apache Jmeter, and Gatling to run performance tests. Because the Pronto team needs to run a lot of benchmark tests on every type of machines and Elasticsearch versions, and we need to run performance tests for combinations of Elasticsearch configuration parameters on many Elasticsearch clusters, these tools cannot satisfy our requirements.

The Pronto team built an online performance analysis service based on Gatling to help customers and us run performance tests and do regression. The features provided by the service allows us to:

  1. Easily add/edit tests. Users can generate tests according to user input query or document structure, without Gatling or Scala knowledge.
  2. Run multiple tests in sequence without human involvement. It can check status and change Elasticsearch settings before/after every test.
  3. Help users compare and analyze test result analysis. Test results and cluster statistics during testing are persisted and can be analyzed by predefined Kibana visualizations.
  4. Run tests from command line or web UI. The Rest API is also provided for each integration with other systems.

Here is the architecture.

Performance test service architecture

Performance test service architecture (click to enlarge diagram)

Users can view Gatling reports for every test and view Kibana predefined visualizations for further analysis and comparison, as shown below.

Gatling report

Gatling report

Gatling report

Test service report

Gatling report

Summary

This article summarizes the index/shard/replica design as well as some other configurations that you should consider when designing an Elasticsearch cluster to meet the high expectation of ingestion and search performance. It also illustrates how Pronto strategically helps customers to do initial sizing, index design and tuning, and performance testing. As of today, the Pronto team has helped a number of customers, including Order Management System (OMS) and Search Engine Optimization (SEO), to achieve their demanding performance goals and thus contribute to eBay's key business.

Elasticsearch performance depends on a lot of factors, including document structure, document size, index settings/mappings, request rate, dataset size, query hit count, and so on. A recommendation for one scenario does not necessarily work for another one. It’s important to test performance thoroughly, gather telemetry, tune the configuration based on your workloads, and optimize the clusters to meet your performance requirement.

相关 [elasticsearch performance tuning] 推荐:

Elasticsearch Performance Tuning Practice at eBay

- -
Elasticsearch is an open source search and analytic engine based on Apache Lucene that allows users to store, search, analyze data in near real time. This document summarizes the challenges as well as the process and tools that the Pronto team builds to address the challenges in a strategic way.

【ActiveMQ Tuning】Prefetch Limit

- - 博客园_首页
   摘要:ActiveMQ优化 客户端优化 预取限制. 原文: http://fusesource.com/docs/broker/5.4/tuning/GenTuning-Consumer-Prefetch.html. Overview:图列4.1阐明了Broker在等待之前发送给客户端消息的反馈的行为.

【ActiveMQ Tuning】Serializing to Disk

- - 博客园_首页
     翻译自: http://fusesource.com/docs/broker/5.4/tuning/PersTuning-SerialToDisk.html.      KahaDB message store:KahaDB 是ActiveMQ Broker 为了高性能而推荐使用的消息存储机制.

MySQL的Performance Schema库

- - 数据库 - ITeye博客
Performance Schema是MySQL自带的、较为底层的性能监控特性,提供一系列、具备自定义收集粒度的监控体系. 对MySQL服务执行过程中的各项事件(Events)的分析尤为重视. Performance Schema的精细化控制,主要通过performance_schema库下的一系列setup表来实现.

译|High-Performance Server Architecture

- - 掘金 架构
本文的目的是分享我多年来关于如何开发某种应用程序的一些想法,对于这种应用程序,术语“服务”只是一个无力的近似称呼. 更准确地说,将写的与一大类程序有关,这些程序旨每秒处理大量离散的消息或请求. 网络服务通常最适合此定义,但从某种意义上讲,实际上并非所有的程序都是服务. 但是,由于“高性能请求处理程序”是很糟糕的标题,为简单起见,倒不如叫“服务”万事大吉.

ORACLE SQL Performance Analyzer的使用

- - CSDN博客数据库推荐文章
通过 SPA,您可以根据各种更改类型(如初始化参数更改、优化器统计刷新和数据库升级)播放特定的. SQL 或整个 SQL 负载,然后生成比较报告,帮助您评估它们的影响.. 在 Oracle Database 11g 之前的版本中,我必须捕获所有 SQL 语句,通过跟踪运行这些语句,. 然后得到执行计划 — 这是一项极其耗时又极易出错的任务.

宇宙微調論證(fine-tuning argument)

- Calon - 哲學哲學雞蛋糕.
這麼扯的事情竟然會發生,這只能用神蹟來解釋. 」許多人使用這種「奇蹟論證」的格式來建立支持上帝存在的主張,這些主張的不同之處,大致上在於它們訴諸不同的「神蹟」. 有些人相信上帝(或土地公,whatever)存在,因為若不是這樣,他沒有辦法解釋自己的某些奇特經歷(例如摔倒之後腦瘤就好了). 智慧設計論者相信萬物是神創的,因為他們不認為這些具備各種奇奇怪怪精細器官的動植物,能夠經由巧合自己蹦出來.

MySQL SQL Tuning:深入理解Order By

- - CSDN博客数据库推荐文章
在MySQL中ORDER BY按先后顺序有2种实现方式,先走索引无排序,如果不行,则用FILESORT. 走索引无排序需要满足2个条件:. ①排序字段和执行计划中所利用INDEX的索引键(或前面几个索引键)完全一致. ②表访问方式为index、ref或range [注释:explain输出中的Type可看出].

tsung: 好文: PHP performance tips - Google Webmaster

- 小汐 - Planet DebianTW
Google 提供的 PHP 效能調校(Optimize performance)的幾點原則, 詳細可見此文:. 原文: Let's make the web faster - PHP performance tips. 作者: Eric Higgins, Google Webmaster. 此文內容很短, 講得都是大方向, 主題內容如下:.

SQL Performance Analyzer SPA常用脚本汇总

- - CSDN博客数据库推荐文章
附件为 一个SPA报告  spa_buffergets_summary. SQL 性能分析器 SQL Performance Analyzer SPA. Oracle Database 11g 引入了 SQL 性能分析器;使用该工具可以准确地评估更改对组成工作量的 SQL 语句的影响. SQL 性能分析器可帮助预测潜在的更改对 SQL 查询工作量的性能影响.