Hive Join 优化 翻译
Join Optimization ----Join 调优 Improvements to the Hive Optimizer ----Hive的优化 Star Join Optimization ----星型结构的优化 Star Schema Example ----例子 Prior Support for MAPJOIN ----MapJoin的预备知识 Limitations of Prior Implementation ----使用限制 Enhancements for Star Joins ----优化星型join Optimize Chains of Map Joins ----优化一连串的Map join Current and Future Optimizations ----当前以及未来的优化方向 Optimize Auto Join Conversion ----自动join转换 Current Optimization ----当前的优化方向 Auto Conversion to SMB Map Join ----自动转换成SMS join Generate Hash Tables on the Task Side ----在task side生成hash表 Pros and Cons of Client-Side Hash Tables ----优缺点 Task-Side Generation of Hash Tables ----在task siade 生成hash表 Further Options for Optimization----优化方向
Improvements to the Hive Optimizer
Hive automatically recognizes various use cases and optimizes for them. Hive 0.11 improves the optimizer for these cases: hive可以自动优化,在0.11里面改进了一些优化用例
- Joins where one side fits in memory. In the new optimization: join的一边适合放进内存,有新的优化方案
- that side is loaded into memory as a hash table 把表按照hash表的形式读进内存
- only the larger table needs to be scanned 只扫描大表
- fact tables have a smaller footprint in memory fact表只使用少量内存
- Star-schema joins 星型join
- Hints are no longer needed for many cases. 在很多情况下不再需要HINT
- Map joins are automatically picked up by the optimizer. Map join 自动优化
Star Join Optimization
Star Schema Example
Select count(*) cnt From store_sales ss join household_demographics hd on (ss.ss_hdemo_sk = hd.hd_demo_sk) join time_dim t on (ss.ss_sold_time_sk = t.t_time_sk) join store s on (s.s_store_sk = ss.ss_store_sk) Where t.t_hour = 8 t.t_minute >= 30 hd.hd_dep_count = 2 order by cnt;
DW 常用的star schema,这个属于BI 基本概念了就不解释了。
Prior Support for MAPJOIN
The default value for hive.auto.convert.join was false in Hive 0.10.0. Hive 0.11.0 changed the default to true ( HIVE-3297).hive.auto.convert.join 在0.10默认是fales,到了0.11就是变成了true
MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through. The prior implementation has this division of labor:
MAPJOINs 把小表以hash map的形式读进内存,然后和大表匹配key,以下是各个阶段的分工
- Local work:本地
- read records via standard table scan (including filters and projections) from source on local machine扫描表
- build hashtable in memory建立hash表读进内存
- write hashtable to local disk写进本地磁盘
- upload hashtable to dfs上传到HDFS
- add hashtable to distributed cache把hash表加进分布式缓存
- Map task
- read hashtable from local disk (distributed cache) into memory从本地磁盘(分布式缓存)把hashtable读进内存
- match records' keys against hashtable key匹配
- combine matches and write to output 合并匹配,写output
- No reduce task Map join的特点,没有reduce
Limitations of Prior Implementation
- The mapjoin operator can only handle one key at a time; that is, it can perform a multi-table join, but only if all the tables are joined on the same key. (Typical star schema joins do not fall into this category.)一个mapjoin只能处理一次一个key,它可以执行的多表连接,但只有当所有的表都加入了相同的key。(典型的星型连接不属于这一类)
- Hints are cumbersome for users to apply correctly and auto conversion doesn't have enough logic to consistently predict if a MAPJOIN will fit into memory or not.就算加了hint也未必,真的是用mapjoin
- A chain of MAPJOINs is not coalesced into a single map-only job, unless the query is written as a cascading sequence of
mapjoin(table, subquery(mapjoin(table, subquery....)
. Auto conversion never produces a single map-only job.一连串的mapjoins不会合并成一个单一的map job,除非查询写成一个级联的mapjoin(mapjoin(table, subquery(mapjoin(table, subquery....).自动转换后的也不会变成一个单一的map job。 - The hashtable for the mapjoin operator has to be generated for each run of the query, which involves downloading all the data to the Hive client machine as well as uploading the generated hashtable files.mapjoin 中用到的哈希表,每个子QUERY运行都会生成,先下载,再分发给map
Enhancements for Star Joins
the join optimizations can be grouped into three parts:join 调优从3部分入手
- Execute chains of mapjoins in the operator tree in a single map-only job, when maphints are used.使用MapJoin Hint时,把一连串的Mapjoin操作变成一个map-only的job
- Extend optimization to the auto-conversion case (generating an appropriate backup plan when optimizing).把优化方案尽可能的变成自动优化(顺便备份下执行计划)
- Generate in-memory hashtable completely on the task side. (Future work.)使得hashtable在task side(map端)直接生成,现在的方案是先在本地生成,然后传到HDFS,再用分布式缓存去分给每个map, 未来的版本会实现。
Optimize Chains of Map Joins
The following query will produce two separate map-only jobs when executed: 下面的SQL会被分解成2个map-only的job执行
select /*+ MAPJOIN(time_dim, date_dim) */ count(*) from store_sales join time_dim on (ss_sold_time_sk = t_time_sk) join date_dim on (ss_sold_date_sk = d_date_sk) where t_hour = 8 and d_year = 2002
It is likely, though, that for small dimension tables the parts of both tables needed would fit into memory at the same time. This reduces the time needed to execute this query dramatically, as the fact table is only read once instead of reading it twice and writing it to HDFS to communicate between the jobs.把小表读进内存,如果fact只读了一次,而不是读2次,那么会极大的减少执行时间
Current and Future Optimizations 调优的方向
- Merge M*-MR patterns into a single MR.把多个Map-only的job +MR job 的模式变成单个MR
- Merge MJ->MJ into a single MJ when possible.尽可能的把mapjoin 嵌套的模式 变成一个mapjoin
- Merge MJ* patterns into a single Map stage as a chain of MJ operators. (Not yet implemented.)把多个mapjoin串起来,变成一连串的mapjoin(上面的例子是分成2个map-only的job,而不是一连串的,这个功能现在还么有!!!)
If hive.auto.convert.join
is set to true the optimizer not only converts joins to mapjoins but also merges MJ* patterns as much as possible. 如果 hive.auto.convert.join这个开关打开的话,不仅仅变成mapjoin,还会尽可能的转换成MJ*这种模式(还不支持的那种)
Optimize Auto Join Conversion
When auto join is enabled, there is no longer a need to provide the map-join hints in the query. The auto join option can be enabled with two configuration parameters:当auto join 开关打开,就不再需要使用hint了,开关的参数有2个
set hive.auto.convert.join.noconditionaltask = true; set hive.auto.convert.join.noconditionaltask.size = 10000;
The default for hive.auto.convert.join.noconditionaltask is true which means auto conversion is enabled. 0.11里面这个默认值为true.
The size configuration enables the user to control what size table can fit in memory. 小于这个size的表可以被放进内存
This value represents the sum of the sizes of tables that can be converted to hashmaps that fit in memory. 这个size的大小指的是被放进内存的哈希表的大小总和
Currently, n-1 tables of the join have to fit in memory for the map-join optimization to take effect. 当前版本,n-1个表都可以被放进内存,最大的那个表放在磁盘上march
There is no check to see if the table is a compressed one or not and what the potential size of the table can be. 在这里,不会去检查表是否被压缩.意思应该直接从HDFS中得到的file的大小
The effect of this assumption on the results is discussed in the next section.这一行为的假设,会在下一节讨论
For example, the previous query just becomes:那么以前的例子就可以变成
select count(*) from store_sales join time_dim on (ss_sold_time_sk = t_time_sk) join date_dim on (ss_sold_date_sk = d_date_sk) where t_hour = 8 and d_year = 2002
If time_dim and date_dim fit in the size configuration provided, the respective joins are converted to map-joins. 如果这2个维表的大小符合config的size,就会转换成map-join(2个). 这里的size 我认为应该是指 hive.smalltable.filesize 这个值 默认25m,奇怪的为什么这个值比noconditionaltask.size 大.
If the sum of the sizes of the tables can fit in the configured size, then the two map-joins are combined resulting in a single map-join. 如果维表的总和小于noconditionaltask.size 会把2个map-join 合并成一个
This reduces the number of MR-jobs required and significantly boosts the speed of execution of this query. 这样做减少了MR job的数量,并显著提高了query的速度
This example can be easily extended for multi-way joins as well and will work as expected.这个例子可以很容易地扩展为muti-way join 以及将按预期工作。
Outer joins offer more challenges. 外连接不能用map-join.
Since a map-join operator can only stream one table, the streamed table needs to be the one from which all of the rows are required. 因为map-join 只能有一个steam表,steam表的所有column都应该是全的, 外连接可能出现null,所以不能用吧
For the left outer join, this is the table on the left side of the join; for the right outer join, the table on the right side, etc.
This means that even though an inner join can be converted to a map-join, an outer join cannot be converted. 这意味着外连接不能用,只有内连接才能map-join
An outer join can only be converted if the table(s) apart from the one that needs to be streamed can be fit in the size configuration.外连接只能用stream table的形式来调优了.
A full outer join cannot be converted to a map-join at all since both tables need to be streamed.笛卡尔积就更别说了..
Auto join conversion also affects the sort-merge-bucket joins.自动开关也可以作用在sort-merge-bucket joins
Current Optimization 当前的优化方案
Group as many MJ operators as possible into one MJ.把多个MJ合并成一个
Auto Conversion to SMB Map Join
Sort-Merge-Bucket (SMB) joins can be converted to SMB map joins as well. 基于桶的join,可以被转换成 基于桶的map join
SMB joins are used wherever the tables are sorted and bucketed.前提是表的按桶分的
The join boils down to just merging the already sorted tables, allowing this operation to be faster than an ordinary map-join. 排过序的表会比没排序的表做map join更快
However, if the tables are partitioned, there could be a slow down as each mapper would need to get a very small chunk of a partition which has a single key.如果表又是分区表,又是bucket表,会分成很多个小块
The following configuration settings enable the conversion of an SMB to a map-join SMB: 开关如下
set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.auto.convert.sortmerge.join.noconditionaltask=true;
There is an option to set the big table selection policy using the following configuration:大表配置策略
set hive.auto.convert.sortmerge.join.bigtable.selection.policy = org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ;
By default, the selection policy is average partition size. The big table selection policy helps determine which table to choose for only streaming, as compared to hashing and streaming.默认情况下为平均分区,这个策略有助于确定选择Stream,相比是哈希还是流来说.
The available selection policies are:策略列表如下
org.apache.hadoop.hive.ql.optimizer.AvgPartitionSizeBasedBigTableSelectorForAutoSMJ (default)
org.apache.hadoop.hive.ql.optimizer.LeftmostBigTableSelectorForAutoSMJ
org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ
The names describe their uses. This is especially useful for the fact-fact join 根据名字就能判断用途.特别是用在 fact 和 fact的join中
Generate Hash Tables on the Task Side
Future work will make it possible to generate in-memory hashtables completely on the task side.未来的版本,可能会把哈希放到task side(当前是放在客户端生成的)
Pros and Cons of Client-Side Hash Tables 在客户端生产哈希表的优缺点
Generating the hashtable (or multiple hashtables for multitable joins) on the client machine has drawbacks. 无论是生成哈希,还是多表的哈希join 都有问题
(The client machine is the host that is used to run the Hive client and submit jobs.)因为客户端的机器都是用来跑hive客户端 或者 是用来提交job的
Data locality: The client machine typically is not a data node. All the data accessed is remote and has to be read via the network.数据分布,客户端机器一般都不是数据节点,所有的数据访问是远程的,必须通过网络读取。
Specs: For the same reason, it is not clear what the specifications of the machine running this processing will be. It might have limitations in memory, hard drive, or CPU that the task nodes do not have.空间:出于同样的原因,这是不清楚的机器有点什么。任务节点上的内存,硬盘,cpu情况也不清楚。
HDFS upload: The data has to be brought back to the cluster and replicated via the distributed cache to be used by task nodes.数据拷贝都是问题
Pre-processing the hashtables on the client machine also has some benefits:预先在客户端生成哈希的好处有
1.What is stored in the distributed cache is likely to be smaller than the original table (filter and projection).因为做了filter或者投影,生成的哈希可能比原始的表要小
2.In contrast, loading hashtables directly on the task nodes using the distributed cache means larger objects in the cache, potentially reducing opportunities for using MAPJOIN.如果在task端使用分布缓存做哈希,意味着缓存会被占用,间接的减小了用mapjoin的可能性
Task-Side Generation of Hash Tables task端生成哈希
When the hashtables are generated completely on the task side, all task nodes have to access the original data source to generate the hashtable. 当在task端生成哈希时,所有任务节点必须访问原始数据源生成的哈希表(同时去访问同一资源)。
Since in the normal case this will happen in parallel it will not affect latency, but Hive has a concept of storage handlers and having many tasks access the same external data source (HBase, database, etc.) might overwhelm or slow down the source.在正常情况下,这一操作是并行的,不会导致延迟,但是hive有一个概念,就是多任务同时访问外部的数据源,如HBASE,DB等,这样就有可能导致延迟了.
Further Options for Optimization未来的优化方向
1.Increase the replication factor on dimension tables. 增加的维表的复制因子。
2.Use the distributed cache to hold dimension tables. 使用分布式缓存来存放维表.
已有 0 人发表留言,猛击->> 这里<<-参与讨论
ITeye推荐