Loading Data into Hive - Pentaho Big Data - Pentaho Wiki

标签: loading data into | 发表时间:2017-06-18 15:29 | 作者:
出处:http://wiki.pentaho.com

How to use a PDI job to load a data file into a Hive table.

Note

For those of you familiar with Hive, you will note that a Hive table could be defined with "external" data. Using the external option, you could define a Hive table that simply uses the HDFS directory that contains the parsed file. For this how-to, we chose not to use the external option so that you can see the ease with which files can be added to non-external Hive tables.

Prerequisites

In order follow along with this how-to guide you will need the following:

  • Hadoop
  • Pentaho Data Integration
  • Hive

Sample Files

The sample data file needed for this guide is:

File NameContent
weblogs_parse.txt.zipUnparsed, raw weblog data

NOTE: If you have previously completed the "Using Pentaho MapReduce to Parse Weblog Data" guide the necessary files will already be the proper directory.

This file should be placed in the /user/pdi/weblogs/parse directory of HDFS using the following commands.

hadoop fs -mkdir /user/pdi/weblogs
hadoop fs -mkdir /user/pdi/weblogs/parse
hadoop fs -put weblogs_parse.txt /user/pdi/weblogs/parse/part-00000

Step-By-Step Instructions

Setup

Start Hadoop if it is not already running.

Start Hive Server if it is not already running.

Create a Hive Table

  1. Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
  2. Create the Table in Hive:You need a hive table to load the data to, so enter the following in the hive shell.
    create table weblogs (
    client_ip    string,
    full_request_date string,
    day    string,
    month    string,
    month_numint,
    year    string,
    hour    string,
    minute    string,
    second    string,
    timezone    string,
    http_verb    string,
    uri    string,
    http_status_code    string,
    bytes_returned        string,
    referrer        string,
    user_agent    string)
    row format delimited
    fields terminated by '\t';
  3. Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.

Create a Job to Load Hive

In this task you will be creating a job to load parsed and delimited weblog data into a Hive table. Once the data is loaded into the table, you will be able to run HiveQL statements to query this data.

Speed Tip

You can download the Kettle Jobload_hive.kjbalready completed
  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Job' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Job' option.
  2. Add a Start Job Entry:You need to tell PDI where to start the job, so expand the 'General' section of the Design palette and drag a 'Start' job entry onto the job canvas. Your canvas should look like:



  3. Add a Copy File Job Entry:You will need to copy the parsed file into the Hive table, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop Copy Files' job entry onto the job canvas. Your canvas should look like:



  4. Connect the Start and Copy Files job entries: Hover the mouse over the 'Start' job entry and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Hadoop Copy Files' node. Your canvas should look like:



  5. Edit the Copy Files Job Entry: Double-click on the 'Hadoop Copy Files' job entry to edit its properties. Enter this information:
    1. File/Folder source: hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/parse

    2. File/Folder destination: hdfs://<NAMENODE>:<PORT>/user/hive/warehouse/weblogs
    3. Wildcard (RegExp): Enter 'part-.*'
    4. Click the 'Add' button to add the files to the list of files to copy.

      When you are done your window should look like (your folder path may be different):



      Click 'OK' to close the window.

      Notice that you could also load a local file into hive using this step. The file does not already have to be in Hadoop.
  6. Save the Job: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'load_hive.kjb' into a folder of your choice.
  7. Run the Job: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the job toolbar. A 'Execute a job' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the job as it runs. After a few seconds the job should finish successfully:

If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.



Check Hive

  1. Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
  2. Query Hive for Data:Verify the data has been loaded to Hive by querying the weblogs table.
    select * from weblogs limit 10;
  3. Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.

Summary

During this guide you learned how to load data into a Hive table using a PDI job. PDI jobs can be used to put files into Hive from many different sources.

Other guides in this series cover how to transform data in Hive, get data out of the Hive, and report on data within the Hive.

相关 [loading data into] 推荐:

Loading Data into Hive - Pentaho Big Data - Pentaho Wiki

- -
Using the external option, you could define a Hive table that simply uses the HDFS directory that contains the parsed file. For this how-to, we chose not to use the external option so that you can see the ease with which files can be added to non-external Hive tables..

各种打不开,属于局域网时代的“Loading”T恤

- Vendi - 理想生活实验室
虽然各种主流媒体一直宣传“我们大力发展网络事业”乱七八糟的,事实上大家都知道,中国不仅网速在全球排名后段班,而且各种和谐啊打不开的事情也比比皆是,因此经常我们就只能看到不断的“Loading”、“Loading”……但网站就是打不开. 这下我们至少有个发泄的渠道了,把“Loading”图样做到衣服上,让衣服代替我们时而想摔鼠标的冲动.

Big Data技术综述

- Ben - 《程序员》杂志官网
Big Data是近来的一个技术热点,但从名字就能判断它并不是什么新词. 历史上,数据库、数据仓库、数据集市等信息管理领域的技术,很大程度上也是为了解决大规模数据的问题. 被誉为数据仓库之父的Bill Inmon早在20世纪90年代就经常将Big Data挂在嘴边了. 然而,Big Data作为一个专有名词成为热点,主要应归功于近年来互联网、云计算、移动和物联网的迅猛发展.

是否该用 Core Data?

- kezhuw - jjgod / blog
Core Data 是 Cocoa 里面一套非常受欢迎的框架,从 Mac OS X 10.4 提供以来,在 10.5 中引入了完善的 schema 迁移机制,再到 iPhone OS 3.0 时被引入 Cocoa Touch,这套完善的框架都被认为是管理大量结构化数据所首选的 Cocoa 框架,尤其是因为使用 Core Data 能大大减少需要手工编写的代码量,就使它更受开发者欢迎了.

Spring Data JPA 简单介绍

- tangfl - BlogJava-首页技术区
考虑到公司应用中数据库访问的多样性和复杂性,目前正在开发UDSL(统一数据访问层),开发到一半的时候,偶遇SpringData工程. 于是就花了点时间了解SpringData,可能UDSL II期会基于SpringData做扩展. 介绍:针对关系型数据库,KV数据库,Document数据库,Graph数据库,Map-Reduce等一些主流数据库,采用统一技术进行访问,并且尽可能简化访问手段.

SpringSource发布Spring Data Redis 1.0.0

- - InfoQ cn
近日, SpringSource 发布了用于将Redis轻松集成到Java应用中的开源 库的首个稳定版. Redis是个由VMWare/SpringSource资助的键值存储,为一些高性能网站如GitHub与StackOverflow等所用. Redis是新近涌现的NoSQL数据存储之一,它关注于简单性与性能(整个数据集放在内存中).

数据治理(Data Governance)

- - ITeye博客
数据治理是指从使用零散数据变为使用统一主数据、从具有很少或没有组织和流程治理到企业范围内的综合数据治理、从尝试处理主数据混乱状况到主数据井井有条的一个过程. 数据治理其实是一种体系,是一个关注于信息系统执行层面的体系,这一体系的目的是整合IT与业务部门的知识和意见,通过一个类似于监督委员会或项目小组的虚拟组织对企业的信息化建设进行全方位的监管,这一组织的基础是企业高层的授权和业务部门与IT部门的建设性合作.

spring data jpa简单实例

- - 编程语言 - ITeye博客
我们都知道Spring是一个非常优秀的JavaEE整合框架,它尽可能的减少我们开发的工作量和难度.   在持久层的业务逻辑方面,Spring开源组织又给我们带来了同样优秀的Spring Data JPA.   通常我们写持久层,都是先写一个接口,再写接口对应的实现类,在实现类中进行持久层的业务逻辑处理.

Data Guard - Snapshot Standby Database配置

- - 数据库 - ITeye博客
一般情况下,物理standby数据库处于mount状态接收和应用主库的REDO日志,物理standby数据库不能对外提供访问. 如果需要只读访问,那么可以临时以read-only的方式open物理备库,或者配置ACTIVE DATA GUARD,那么物理standby数据库可以进行只读(read-only)访问(比如报表业务查询),但是物理standby数据库不能进行读写操作(read-write).

了解 Spring Data JPA - hungerW

- - 博客园_首页
自 JPA 伴随 Java EE 5 发布以来,受到了各大厂商及开源社区的追捧,各种商用的和开源的 JPA 框架如雨后春笋般出现,为开发者提供了丰富的选择. 它一改之前 EJB 2.x 中实体 Bean 笨重且难以使用的形象,充分吸收了在开源社区已经相对成熟的 ORM 思想. 另外,它并不依赖于 EJB 容器,可以作为一个独立的持久层技术而存在.