Hive中跑MapReduce Job出现OOM问题分析及解决

标签: hive 中跑 mapreduce | 发表时间:2014-11-22 04:02 | 作者:u014388509
出处:http://blog.csdn.net

一、引子

今天在跑一段很复杂而且涉及数据量10多年的N个表join的长SQL时,发生了OOM的异常。


由于一个map通常配置只有64MB或者128MB,则在Map阶段出现OOM的情况很少见。所以一般发生在reduce阶段。

但是今天这个异常详细的看后,会发现既不是map阶段,也不是reduce阶段,发现不是执行过程,而是driver提交job阶段就OOM了。
Hive中XMLEncoder序列化MapredWork引发OutOfMemoryError

XMLEncoder导致java.lang.OutOfMemoryError: GC overhead limit exceeded


二、概括回顾

先概括下,Hive中出现OOM的异常原因大致分为以下几种:

1. Map阶段OOM。
2. Reduce阶段OOM。
3. Driver提交Job阶段OOM。


Map阶段OOM:

1. 发生OOM的几率很小,除非你程序的逻辑不正常,亦或是程序写的不高效,产生垃圾太多。
 

Reduce阶段OOM:

1. data skew 数据倾斜

data skew是引发这个的一个原因。 
key分布不均匀,导致某一个reduce所处理的数据超过预期,导致jvm频繁GC。

2. value对象过多或者过大

某个reduce中的value堆积的对象过多,导致jvm频繁GC。


解决办法:

1. 增加reduce个数,set mapred.reduce.tasks=300,。

2. 在hive-site.xml中设置,或者在hive shell里设置 set  mapred.child.java.opts = -Xmx512m

   或者只设置reduce的最大heap为2G,并设置垃圾回收器的类型为并行标记回收器,这样可以显著减少GC停顿,但是稍微耗费CPU。

   set mapred.reduce.child.java.opts=-Xmx2g -XX:+UseConcMarkSweepGC;

3. 使用map join 代替 common join. 可以set hive.auto.convert.join = true

4. 设置 hive.optimize.skewjoin = true 来解决数据倾斜问题


Driver提交job阶段OOM:

 job产生的执行计划的条目太多,比如扫描的分区过多,上到4k-6k个分区的时候,并且是好几张表的分区都很多时,这时做join。

究其原因,是 因为序列化时,会将这些分区,即hdfs文件路径,封装为Path对象,这样,如果对象太多了,而且Driver启动的时候设置的heap size太小,则会导致在Driver内序列化这些MapRedWork时,生成的对象太多,导致频繁GC,则会引发如下异常:

java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.nio.cs.UTF_8.newEncoder(UTF_8.java:53)
at java.beans.XMLEncoder.createString(XMLEncoder.java:572)


三、诊断问题

如何诊断到了问题:

在网上搜异常,在Hive的IRA发现一个issues,和我的情况类似:
问题描述:

Use of XMLEncoder to serialize MapredWork causes OOM in hive cli

When running queries on tables with 6000 partitions, hive cli if configured with 128M runs into OOM. Heapdump showed 37MB occupied by one XMLEncoder object while the MapredWork was 500K which is highly inefficient. We should switch to using something more efficient like XStream.
比较相近的解释:
I ran with 128M to investigate the OOM. We have resorted to running with 1G as XmX because we keep hitting OOM with bigger tables in hive. There were other things that contributed to the memory usage - mostly Path objects because of the higher number of partitions. But they are absolutely needed. XMLEncoder is something that created too much garbage in a very short span and caused GC. That would be something easy to change/fix without having to touch the core logic.
We should be looking at fixing the root cause of the problem instead of keeping on increasing the memory requirements. Ours is a highly multi-tenant system and there are lot of other programs(pig,etc) running too in the gateway. So running with a lower memory(256-512MB) will help.
Found two other reports of this issue:
http://mail-archives.apache.org/mod_mbox/hive-user/201106.mbox/%[email protected]%3E
https://issues.apache.org/jira/browse/HIVE-1316
This fix increased the max heap size of CLI client and disabled GC overhead limit.

于是亲自确认一下问题处在哪个地方:

因为这段复杂的sql有22个job,每次到第20个job的时候就会出现oom,所以我特意将数据置空,调试这在shell下跑。
发现跑到第20个job,扫描的正是扫描了14年的注册表,推断是分区太多,造成的序列化时对象太多OOM。
以下是调试图:





将查询计划打印出来,会发现很大,主要是因为有一个注册表,扫描了近14年的历史数据,14×365 = 5110 个分区。除去早些年份数据的不完整性,大约3138个分区。

每个分区都会生成一个Path对象,这样仅仅是这一个表,生成的查询计划被序列化为对象会耗去大半内存,如果在和其它的表好几年的数据继续做join的话,又会耗去更多内存来序列化成对象,这样就会频繁GC,GC不出来,就会抛出OOM的异常。

仅仅是一个查询,都打印出了25MB,说明扫描的数据量是在是太多了。

  
explain extended  a_complex_hql_query > /app/hadoop/shengli/large_plan
du -sh large_plan 
25M     large_plan

详细日志如下:

2014-11-21 13:13:28,334 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Hadoop job information for Stage-23: number of mappers: 12; number of reducers: 1
2014-11-21 13:13:28,354 INFO [Thread-2] AbstractBaseAction$StreamDrainer - 2014-11-21 13:13:28,347 Stage-23 map = 50%,  reduce = 0%, Cumulative CPU 42.04 sec
2014-11-21 13:13:48,370 INFO [Thread-2] AbstractBaseAction$StreamDrainer - 2014-11-21 13:13:48,363 Stage-23 map = 100%,  reduce = 33%, Cumulative CPU 262.39 sec
2014-11-21 13:14:29,228 INFO [Thread-2] AbstractBaseAction$StreamDrainer - 2014-11-21 13:14:29,222 Stage-23 map = 100%,  reduce = 69%, Cumulative CPU 262.39 sec
2014-11-21 13:14:49,248 INFO [Thread-2] AbstractBaseAction$StreamDrainer - 2014-11-21 13:14:49,239 Stage-23 map = 100%,  reduce = 92%, Cumulative CPU 324.73 sec
2014-11-21 13:15:11,952 INFO [Thread-2] AbstractBaseAction$StreamDrainer - 2014-11-21 13:15:11,944 Stage-23 map = 100%,  reduce = 100%, Cumulative CPU 360.1 sec
2014-11-21 13:15:36,740 INFO [Thread-2] AbstractBaseAction$StreamDrainer - 2014-11-21 13:15:36,734 Stage-23 map = 100%,  reduce = 100%, Cumulative CPU 360.1 sec
2014-11-21 13:15:36,741 INFO [Thread-2] AbstractBaseAction$StreamDrainer - MapReduce Total cumulative CPU time: 6 minutes 0 seconds 100 msec
2014-11-21 13:15:36,749 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Ended Job = job_201411141019_68657
2014-11-21 13:15:40,306 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Launching Job 20 out of 21
2014-11-21 13:15:42,162 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Number of reduce tasks not specified. Estimated from input data size: 252
2014-11-21 13:15:42,163 INFO [Thread-2] AbstractBaseAction$StreamDrainer - In order to change the average load for a reducer (in bytes):
2014-11-21 13:15:42,163 INFO [Thread-2] AbstractBaseAction$StreamDrainer -   set hive.exec.reducers.bytes.per.reducer=<number>
2014-11-21 13:15:42,163 INFO [Thread-2] AbstractBaseAction$StreamDrainer - In order to limit the maximum number of reducers:
2014-11-21 13:15:42,163 INFO [Thread-2] AbstractBaseAction$StreamDrainer -   set hive.exec.reducers.max=<number>
2014-11-21 13:15:42,163 INFO [Thread-2] AbstractBaseAction$StreamDrainer - In order to set a constant number of reducers:
2014-11-21 13:15:42,163 INFO [Thread-2] AbstractBaseAction$StreamDrainer -   set mapred.reduce.tasks=<number>
2014-11-21 13:16:40,377 INFO [Thread-2] AbstractBaseAction$StreamDrainer - java.lang.OutOfMemoryError: GC overhead limit exceeded
2014-11-21 13:16:40,378 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at sun.nio.cs.UTF_8.newEncoder(UTF_8.java:53)
2014-11-21 13:16:40,378 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.createString(XMLEncoder.java:572)
2014-11-21 13:16:40,378 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputValue(XMLEncoder.java:543)
2014-11-21 13:16:40,378 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputStatement(XMLEncoder.java:682)
2014-11-21 13:16:40,378 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputStatement(XMLEncoder.java:687)
2014-11-21 13:16:40,378 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputValue(XMLEncoder.java:552)
2014-11-21 13:16:40,378 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputStatement(XMLEncoder.java:682)
2014-11-21 13:16:40,378 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputStatement(XMLEncoder.java:687)
2014-11-21 13:16:40,379 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputValue(XMLEncoder.java:552)
2014-11-21 13:16:40,379 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputStatement(XMLEncoder.java:682)
2014-11-21 13:16:40,379 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputStatement(XMLEncoder.java:687)
2014-11-21 13:16:40,379 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputValue(XMLEncoder.java:552)
2014-11-21 13:16:40,379 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputStatement(XMLEncoder.java:682)
2014-11-21 13:16:40,379 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputStatement(XMLEncoder.java:687)
2014-11-21 13:16:40,379 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputValue(XMLEncoder.java:552)
2014-11-21 13:16:40,380 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputStatement(XMLEncoder.java:682)
2014-11-21 13:16:40,380 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputStatement(XMLEncoder.java:687)
2014-11-21 13:16:40,380 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.outputValue(XMLEncoder.java:552)
2014-11-21 13:16:40,380 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.flush(XMLEncoder.java:398)
2014-11-21 13:16:40,380 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at java.beans.XMLEncoder.close(XMLEncoder.java:429)
2014-11-21 13:16:40,380 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at org.apache.hadoop.hive.ql.exec.Utilities.serializeMapRedWork(Utilities.java:532)
2014-11-21 13:16:40,380 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at org.apache.hadoop.hive.ql.exec.Utilities.setMapRedWork(Utilities.java:376)
2014-11-21 13:16:40,381 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:418)
2014-11-21 13:16:40,381 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:138)
2014-11-21 13:16:40,381 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:144)
2014-11-21 13:16:40,381 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
2014-11-21 13:16:40,381 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1355)
2014-11-21 13:16:40,381 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1139)
2014-11-21 13:16:40,381 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:945)
2014-11-21 13:16:40,381 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
2014-11-21 13:16:40,381 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
2014-11-21 13:16:40,381 INFO [Thread-2] AbstractBaseAction$StreamDrainer -      at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
2014-11-21 13:16:40,381 INFO [Thread-2] AbstractBaseAction$StreamDrainer - FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.MapRedTask
2014-11-21 13:16:40,382 INFO [Thread-2] AbstractBaseAction$StreamDrainer - MapReduce Jobs Launched: 
2014-11-21 13:16:40,382 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 0: Map: 33  Reduce: 5   Cumulative CPU: 903.59 sec   HDFS Read: 4225680985 HDFS Write: 123129169 S
UCCESS
2014-11-21 13:16:40,382 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 1: Map: 192  Reduce: 4   Cumulative CPU: 3215.94 sec   HDFS Read: 3036813878 HDFS Write: 322647287
 SUCCESS
2014-11-21 13:16:40,382 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 2: Map: 52  Reduce: 4   Cumulative CPU: 1314.04 sec   HDFS Read: 3278902794 HDFS Write: 435711878 
SUCCESS
2014-11-21 13:16:40,382 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 3: Map: 959  Reduce: 70   Cumulative CPU: 55831.24 sec   HDFS Read: 69728403753 HDFS Write: 457377
8468 SUCCESS
2014-11-21 13:16:40,382 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 4: Map: 154  Reduce: 1   Cumulative CPU: 1809.45 sec   HDFS Read: 483233734 HDFS Write: 25999761 S
UCCESS
2014-11-21 13:16:40,382 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 5: Map: 84  Reduce: 6   Cumulative CPU: 2466.01 sec   HDFS Read: 5618486947 HDFS Write: 1649704865
 SUCCESS
2014-11-21 13:16:40,382 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 6: Map: 450  Reduce: 55   Cumulative CPU: 22239.14 sec   HDFS Read: 54635746333 HDFS Write: 728231
5124 SUCCESS
2014-11-21 13:16:40,382 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 7: Map: 359  Reduce: 27   Cumulative CPU: 14699.06 sec   HDFS Read: 26702083597 HDFS Write: 236403
3090 SUCCESS
2014-11-21 13:16:40,382 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 8: Map: 83  Reduce: 6   Cumulative CPU: 2410.03 sec   HDFS Read: 5514015601 HDFS Write: 628742985 
SUCCESS
2014-11-21 13:16:40,382 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 9: Map: 84  Reduce: 6   Cumulative CPU: 2200.0 sec   HDFS Read: 5618486947 HDFS Write: 853325331 S
UCCESS
2014-11-21 13:16:40,383 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 10: Map: 365  Reduce: 27   Cumulative CPU: 13274.58 sec   HDFS Read: 27143622450 HDFS Write: 29912
35104 SUCCESS
2014-11-21 13:16:40,383 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 11: Map: 959  Reduce: 70   Cumulative CPU: 55145.65 sec   HDFS Read: 69728403753 HDFS Write: 43358
51625 SUCCESS
2014-11-21 13:16:40,383 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 12: Map: 172  Reduce: 1   Cumulative CPU: 1561.64 sec   HDFS Read: 945050606 HDFS Write: 40445679 
SUCCESS
2014-11-21 13:16:40,383 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 13: Map: 260  Reduce: 9   Cumulative CPU: 5965.67 sec   HDFS Read: 8639664681 HDFS Write: 38277094
9 SUCCESS
2014-11-21 13:16:40,383 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 14: Map: 212  Reduce: 15   Cumulative CPU: 7573.68 sec   HDFS Read: 14849298755 HDFS Write: 109518
4574 SUCCESS
2014-11-21 13:16:40,383 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 15: Map: 16  Reduce: 2   Cumulative CPU: 981.37 sec   HDFS Read: 1660614485 HDFS Write: 795259370 
SUCCESS
2014-11-21 13:16:40,383 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 16: Map: 27  Reduce: 3   Cumulative CPU: 1357.34 sec   HDFS Read: 2991246795 HDFS Write: 238860030
1 SUCCESS
2014-11-21 13:16:40,383 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 17: Map: 70  Reduce: 5   Cumulative CPU: 2138.48 sec   HDFS Read: 4573808778 HDFS Write: 290957162
1 SUCCESS
2014-11-21 13:16:40,383 INFO [Thread-2] AbstractBaseAction$StreamDrainer - Job 18: Map: 12  Reduce: 1   Cumulative CPU: 360.1 sec   HDFS Read: 853346317 HDFS Write: 705875733 SU
CCESS

三、Driver阶段OOM解决方案:

原因找到了,是因为扫描的表的分区太多,上到3千到6千个分区,这样在对计划进行序列化时,仅仅是路径对象Path就会耗去大半Driver,如果Driver设置的heap太小,甚至都会OOM。

解决思路:

1.  减少分区数量,将历史数据做成一张整合表,做成增量数据表,这样分区就很少了。

2. 调大Hive CLI Driver的heap size, 默认是256MB,调节成512MB或者更大。

具体做法是在bin/hive bin/hive-config里可以找到启动CLI的JVM OPTIONS。

这里我们设置

export HADOOP_HEAPSIZE=512


我的解决方法是,双管齐下。

即做成了整合,方便使用,又调节了Hive CLI Driver的heap size,保证线上的运行稳定。


四、总结:

  写在结尾,今天只顾折腾这个问题了,遇到这种问题:
  一是HiveQL的写法上,尽量少的扫描同一张表,并且尽量少的扫描分区。扫太多,一是job数多,慢,二是耗费网络资源,慢。
  二是Hive的参数调优和JVM的参数调优,尽量在每个阶段,选择合适的jvm max heap size来应对OOM的问题。

——EOF——
原创文章,转载请注明出自: http://blog.csdn.net/oopsoom/article/details/41356251
作者:u014388509 发表于2014-11-21 20:02:47 原文链接
阅读:225 评论:0 查看评论

相关 [hive 中跑 mapreduce] 推荐:

Hive中跑MapReduce Job出现OOM问题分析及解决

- - CSDN博客云计算推荐文章
今天在跑一段很复杂而且涉及数据量10多年的N个表join的长SQL时,发生了OOM的异常. 由于一个map通常配置只有64MB或者128MB,则在Map阶段出现OOM的情况很少见. 所以一般发生在reduce阶段. 但是今天这个异常详细的看后,会发现既不是map阶段,也不是reduce阶段,发现不是执行过程,而是driver提交job阶段就OOM了.

Mapreduce小结

- MAGI-CASPER/Peter Pan - 博客园-唯有前进值得敬仰
读完mapreduce论文小结一下. 1.MapReduce是一个编程模型,封装了并行计算、容错、数据分布、负载均衡等细节问题. 输入是一个key-value对的集合,中间输出也是key-value对的集合,用户使用两个函数:Map和Reduce. Map函数接受一个输入的key-value对,然后产生一个中间key-value 对的集合.

Hadoop MapReduce技巧

- - 简单文本
我在使用Hadoop编写MapReduce程序时,遇到了一些问题,通过在Google上查询资料,并结合自己对Hadoop的理解,逐一解决了这些问题. Hadoop对MapReduce中Key与Value的类型是有要求的,简单说来,这些类型必须支持Hadoop的序列化. 为了提高序列化的性能,Hadoop还为Java中常见的基本类型提供了相应地支持序列化的类型,如IntWritable,LongWritable,并为String类型提供了Text类型.

MapReduce原理

- - C++博客-牵着老婆满街逛
       MapReduce 是由Google公司的Jeffrey Dean 和 Sanjay Ghemawat 开发的一个针对大规模群组中的海量数据处理的分布式编程模型. MapReduce实现了两个功能. Map把一个函数应用于集合中的所有成员,然后返回一个基于这个处理的结果集. 而Reduce是把从两个或更多个Map中,通过多个线程,进程或者独立系统并行执行处理的结果集进行分类和归纳.

MapReduce优化

- - 行业应用 - ITeye博客
相信每个程序员在 编程时都会问自己两个问题“我如何完成这个任务”,以及“怎么能让程序运行得更快”. 同样,MapReduce计算模型的多次优化也是为了更好地解答这两个问题. MapReduce计算模型的优化涉及了方方面面的内容,但是主要集中在两个方面:一是计算性能方面的优化;二是I/O操作方面的优化.

Spark与Mapreduce?

- - 崔永键的博客
我本人是类似Hive平台的系统工程师,我对MapReduce的熟悉程度是一般,它是我的底层框架. 我隔壁组在实验Spark,想将一部分计算迁移到Spark上. 年初的时候,看Spark的评价,几乎一致表示,Spark是小数据集上处理复杂迭代的交互系统,并不擅长大数据集,也没有稳定性. 但是最近的风评已经变化,尤其是14年10月他们完成了Peta sort的实验,这标志着Spark越来越接近替代Hadoop MapReduce了.

hive调优

- - 互联网 - ITeye博客
一、    控制hive任务中的map数: . 1.    通常情况下,作业会通过input的目录产生一个或者多个map任务. 主要的决定因素有: input的文件总个数,input的文件大小,集群设置的文件块大小(目前为128M, 可在hive中通过set dfs.block.size;命令查看到,该参数不能自定义修改);.

hive 优化 tips

- - CSDN博客推荐文章
一、     Hive join优化. 也可以显示声明进行map join:特别适用于小表join大表的时候,SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key. 2.     注意带表分区的join, 如:.

Hive中的join

- - CSDN博客云计算推荐文章
select a.* from a join b on a.id = b.id select a.* from a join b on (a.id = b.id and a.department = b.department). 在使用join写查询的时候有一个原则:应该将条目少的表或者子查询放在join操作符的左边.

hive优化(2)

- - 开源软件 - ITeye博客
Hive是将符合SQL语法的字符串解析生成可以在Hadoop上执行的MapReduce的工具. 使用Hive尽量按照分布式计算的一些特点来设计sql,和传统关系型数据库有区别,. 所以需要去掉原有关系型数据库下开发的一些固有思维. 1:尽量尽早地过滤数据,减少每个阶段的数据量,对于分区表要加分区,同时只选择需要使用到的字段.