Loading Data into Hive - Pentaho Big Data - Pentaho Wiki

标签: | 发表时间:2017-06-18 07:29 | 作者:
出处:http://wiki.pentaho.com

How to use a PDI job to load a data file into a Hive table.

Note

For those of you familiar with Hive, you will note that a Hive table could be defined with "external" data. Using the external option, you could define a Hive table that simply uses the HDFS directory that contains the parsed file. For this how-to, we chose not to use the external option so that you can see the ease with which files can be added to non-external Hive tables.

Prerequisites

In order follow along with this how-to guide you will need the following:

  • Hadoop
  • Pentaho Data Integration
  • Hive

Sample Files

The sample data file needed for this guide is:

File NameContent
weblogs_parse.txt.zipUnparsed, raw weblog data

NOTE: If you have previously completed the "Using Pentaho MapReduce to Parse Weblog Data" guide the necessary files will already be the proper directory.

This file should be placed in the /user/pdi/weblogs/parse directory of HDFS using the following commands.

hadoop fs -mkdir /user/pdi/weblogs
hadoop fs -mkdir /user/pdi/weblogs/parse
hadoop fs -put weblogs_parse.txt /user/pdi/weblogs/parse/part-00000

Step-By-Step Instructions

Setup

Start Hadoop if it is not already running.

Start Hive Server if it is not already running.

Create a Hive Table

  1. Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
  2. Create the Table in Hive:You need a hive table to load the data to, so enter the following in the hive shell.
    create table weblogs (
    client_ip    string,
    full_request_date string,
    day    string,
    month    string,
    month_numint,
    year    string,
    hour    string,
    minute    string,
    second    string,
    timezone    string,
    http_verb    string,
    uri    string,
    http_status_code    string,
    bytes_returned        string,
    referrer        string,
    user_agent    string)
    row format delimited
    fields terminated by '\t';
  3. Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.

Create a Job to Load Hive

In this task you will be creating a job to load parsed and delimited weblog data into a Hive table. Once the data is loaded into the table, you will be able to run HiveQL statements to query this data.

Speed Tip

You can download the Kettle Jobload_hive.kjbalready completed
  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Job' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Job' option.
  2. Add a Start Job Entry:You need to tell PDI where to start the job, so expand the 'General' section of the Design palette and drag a 'Start' job entry onto the job canvas. Your canvas should look like:



  3. Add a Copy File Job Entry:You will need to copy the parsed file into the Hive table, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop Copy Files' job entry onto the job canvas. Your canvas should look like:



  4. Connect the Start and Copy Files job entries: Hover the mouse over the 'Start' job entry and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Hadoop Copy Files' node. Your canvas should look like:



  5. Edit the Copy Files Job Entry: Double-click on the 'Hadoop Copy Files' job entry to edit its properties. Enter this information:
    1. File/Folder source: hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/parse

    2. File/Folder destination: hdfs://<NAMENODE>:<PORT>/user/hive/warehouse/weblogs
    3. Wildcard (RegExp): Enter 'part-.*'
    4. Click the 'Add' button to add the files to the list of files to copy.

      When you are done your window should look like (your folder path may be different):



      Click 'OK' to close the window.

      Notice that you could also load a local file into hive using this step. The file does not already have to be in Hadoop.
  6. Save the Job: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'load_hive.kjb' into a folder of your choice.
  7. Run the Job: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the job toolbar. A 'Execute a job' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the job as it runs. After a few seconds the job should finish successfully:

If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.



Check Hive

  1. Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
  2. Query Hive for Data:Verify the data has been loaded to Hive by querying the weblogs table.
    select * from weblogs limit 10;
  3. Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.

Summary

During this guide you learned how to load data into a Hive table using a PDI job. PDI jobs can be used to put files into Hive from many different sources.

Other guides in this series cover how to transform data in Hive, get data out of the Hive, and report on data within the Hive.

相关 [loading data hive] 推荐:

Loading Data into Hive - Pentaho Big Data - Pentaho Wiki

- -
Using the external option, you could define a Hive table that simply uses the HDFS directory that contains the parsed file. For this how-to, we chose not to use the external option so that you can see the ease with which files can be added to non-external Hive tables..

各种打不开,属于局域网时代的“Loading”T恤

- Vendi - 理想生活实验室
虽然各种主流媒体一直宣传“我们大力发展网络事业”乱七八糟的,事实上大家都知道,中国不仅网速在全球排名后段班,而且各种和谐啊打不开的事情也比比皆是,因此经常我们就只能看到不断的“Loading”、“Loading”……但网站就是打不开. 这下我们至少有个发泄的渠道了,把“Loading”图样做到衣服上,让衣服代替我们时而想摔鼠标的冲动.

hive 优化 tips

- - CSDN博客推荐文章
一、     Hive join优化. 也可以显示声明进行map join:特别适用于小表join大表的时候,SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key. 2.     注意带表分区的join, 如:.

Hive中的join

- - CSDN博客云计算推荐文章
select a.* from a join b on a.id = b.id select a.* from a join b on (a.id = b.id and a.department = b.department). 在使用join写查询的时候有一个原则:应该将条目少的表或者子查询放在join操作符的左边.

Hive优化

- - 互联网 - ITeye博客
     使用Hive有一段时间了,目前发现需要进行优化的较多出现在出现join、distinct的情况下,而且一般都是reduce过程较慢.      Reduce过程比较慢的现象又可以分为两类:. 情形一:map已经达到100%,而reduce阶段一直是99%,属于数据倾斜. 情形二:使用了count(distinct)或者group by的操作,现象是reduce有进度但是进度缓慢,31%-32%-34%...一个附带的提示是使用reduce个数很可能是1.

hive调优

- - 互联网 - ITeye博客
一、    控制hive任务中的map数: . 1.    通常情况下,作业会通过input的目录产生一个或者多个map任务. 主要的决定因素有: input的文件总个数,input的文件大小,集群设置的文件块大小(目前为128M, 可在hive中通过set dfs.block.size;命令查看到,该参数不能自定义修改);.

hive bucket 桶

- - CSDN博客推荐文章
对于每一个表(table)或者分区,Hive可以进一步组织成桶. Hive也是针对某一列进行桶的组织. Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中. 采用桶能够带来一些好处,比如JOIN操作. 对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作. 那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量.

hive mapjoin使用

- - 淘剑笑的博客
今天遇到一个hive的问题,如下hive sql:. 该语句中B表有30亿行记录,A表只有100行记录,而且B表中数据倾斜特别严重,有一个key上有15亿行记录,在运行过程中特别的慢,而且在reduece的过程中遇有内存不够而报错. 为了解决用户的这个问题,考虑使用mapjoin,mapjoin的原理:.

hive优化(2)

- - 开源软件 - ITeye博客
Hive是将符合SQL语法的字符串解析生成可以在Hadoop上执行的MapReduce的工具. 使用Hive尽量按照分布式计算的一些特点来设计sql,和传统关系型数据库有区别,. 所以需要去掉原有关系型数据库下开发的一些固有思维. 1:尽量尽早地过滤数据,减少每个阶段的数据量,对于分区表要加分区,同时只选择需要使用到的字段.

hive优化

- - 开源软件 - ITeye博客
hive.optimize.cp=true:列裁剪. hive.optimize.prunner:分区裁剪. hive.limit.optimize.enable=true:优化LIMIT n语句. hive.limit.optimize.limit.file=10:最大文件数.   1.job的输入数据大小必须小于参数:hive.exec.mode.local.auto.inputbytes.max(默认128MB).