从hbase(hive)将数据导出到mysql

标签: hbase hive 数据 | 发表时间:2013-04-25 16:04 | 作者:zreodown
出处:http://blog.csdn.net

在上一篇文章《 用sqoop进行mysql和hdfs系统间的数据互导》中,提到sqoop可以让RDBMS和HDFS之间互导数据,并且也支持从mysql中导入到HBase,但从HBase直接导入mysql则不是直接支持,而是间接支持。要么将HBase导出到HDFS平面文件,要么将其导出到Hive中,再导出到mysql。本篇讲从hive中导出到mysql。
从hive将数据导出到mysql

一、创建mysql表

mysql> create table award (rowkey varchar(255), productid int, matchid varchar(255), rank varchar(255), tourneyid varchar(255), userid bigint, gameid int, gold int, loginid varchar(255), nick varchar(255), plat varchar(255));
Query OK, 0 rows affected (0.01 sec)

二、尝试用hive作为外部数据库连接hbase,导入mysql

hive> CREATE EXTERNAL TABLE hive_award(key string, productid int,matchid string, rank string, tourneyid string, userid bigint,gameid int,gold int,loginid string,nick string,plat string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:MPID,info:MatchID,info:Rank,info:TourneyID,info:UserId,info:gameID,info:gold,info:loginId,info:nickName,info:platform") TBLPROPERTIES("hbase.table.name" = "award");
hive> desc hive_award;
key string from deserializer
productid int from deserializer
matchid string from deserializer
rank string from deserializer
tourneyid string from deserializer
userid bigint from deserializer
gameid int from deserializer
gold int from deserializer
loginid string from deserializer
nick string from deserializer
plat string from deserializer
[zhouhh@Hadoop46 ~]$ hadoop fs -ls /user/hive/warehouse/
Found 3 items
drwxr-xr-x - zhouhh supergroup 0 2012-07-16 14:08 /user/hive/warehouse/hive_award
drwxr-xr-x - zhouhh supergroup 0 2012-07-16 14:30 /user/hive/warehouse/nnnon
drwxr-xr-x - zhouhh supergroup 0 2012-07-16 13:53 /user/hive/warehouse/test222
[zhouhh@Hadoop46 ~]$ sqoop export --connect jdbc:mysql://Hadoop48/toplists -m 1 --table award --export-dir /user/hive/warehouse/hive_award --input-fields-terminated-by '\0001'
12/07/19 16:13:06 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
12/07/19 16:13:06 INFO tool.CodeGenTool: Beginning code generation
12/07/19 16:13:06 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `award` AS t LIMIT 1
12/07/19 16:13:06 INFO orm.CompilationManager: HADOOP_HOME is /home/zhouhh/hadoop-1.0.0/libexec/..
注: /tmp/sqoop-zhouhh/compile/4366149f0b6dd311c5b622594744fbb0/award.java使用或覆盖了已过时的 API。
注: 有关详细信息, 请使用 -Xlint:deprecation 重新编译。
12/07/19 16:13:08 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-zhouhh/compile/4366149f0b6dd311c5b622594744fbb0/award.jar
12/07/19 16:13:08 INFO mapreduce.ExportJobBase: Beginning export of award
12/07/19 16:13:09 WARN mapreduce.ExportJobBase: Input path hdfs://Hadoop46:9200/user/hive/warehouse/hive_award contains no files
12/07/19 16:13:11 INFO input.FileInputFormat: Total input paths to process : 0
12/07/19 16:13:11 INFO input.FileInputFormat: Total input paths to process : 0
12/07/19 16:13:13 INFO mapred.JobClient: Running job: job_201207191159_0059
12/07/19 16:13:14 INFO mapred.JobClient: map 0% reduce 0%
12/07/19 16:13:26 INFO mapred.JobClient: Job complete: job_201207191159_0059
12/07/19 16:13:26 INFO mapred.JobClient: Counters: 4
12/07/19 16:13:26 INFO mapred.JobClient: Job Counters
12/07/19 16:13:26 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=7993
12/07/19 16:13:26 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/19 16:13:26 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/07/19 16:13:26 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/07/19 16:13:26 INFO mapreduce.ExportJobBase: Transferred 0 bytes in 16.9678 seconds (0 bytes/sec)
12/07/19 16:13:26 INFO mapreduce.ExportJobBase: Exported 0 records.
直接导外部表不成功,Input path hdfs://Hadoop46:9200/user/hive/warehouse/hive_award contains no files

三、hive中创建连结hbase的表,在hive中的插入会引起hbase的数据改变:

CREATE TABLE hive_award_data(key string,productid int,matchid string,rank string,
tourneyid string,userid bigint,gameid int,
gold int,loginid string,nick string,plat string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:MPID,info:MatchID,info:Rank,info:TourneyID,info:UserId,info:gameID,info:gold,info:loginId,info:nickName,info:platform") TBLPROPERTIES("hbase.table.name" = "award_test");
hive> insert overwrite table hive_award_data select * from hive_award limit 2;
hbase(main):014:0> scan 'award_test'
ROW COLUMN+CELL
 2012-04-27 06:55:00:402713629 column=info:MPID, timestamp=1342754799918, value=5947
 2012-04-27 06:55:00:402713629 column=info:MatchID, timestamp=1342754799918, value=433203828
 2012-04-27 06:55:00:402713629 column=info:Rank, timestamp=1342754799918, value=2
 2012-04-27 06:55:00:402713629 column=info:TourneyID, timestamp=1342754799918, value=4027102
 2012-04-27 06:55:00:402713629 column=info:UserId, timestamp=1342754799918, value=402713629
 2012-04-27 06:55:00:402713629 column=info:gameID, timestamp=1342754799918, value=1001
 2012-04-27 06:55:00:402713629 column=info:loginId, timestamp=1342754799918, value=715878221
 2012-04-27 06:55:00:402713629 column=info:nickName, timestamp=1342754799918, value=xxx
 2012-04-27 06:55:00:402713629 column=info:platform, timestamp=1342754799918, value=ios
 2012-04-27 06:55:00:402713629 column=info:userid, timestamp=1342754445451, value=402713629
 2012-04-27 06:55:00:406788559 column=info:MPID, timestamp=1342754799918, value=778
 2012-04-27 06:55:00:406788559 column=info:MatchID, timestamp=1342754799918, value=433203930
 2012-04-27 06:55:00:406788559 column=info:Rank, timestamp=1342754799918, value=19
 2012-04-27 06:55:00:406788559 column=info:TourneyID, timestamp=1342754799918, value=4017780
 2012-04-27 06:55:00:406788559 column=info:UserId, timestamp=1342754799918, value=406788559
 2012-04-27 06:55:00:406788559 column=info:gameID, timestamp=1342754799918, value=1001
 2012-04-27 06:55:00:406788559 column=info:gold, timestamp=1342754799918, value=1
 2012-04-27 06:55:00:406788559 column=info:loginId, timestamp=1342754799918, value=13835155880
 2012-04-27 06:55:00:406788559 column=info:nickName, timestamp=1342754799918, value=xxx
 2012-04-27 06:55:00:406788559 column=info:platform, timestamp=1342754799918, value=android
2 row(s) in 0.0280 seconds
[zhouhh@Hadoop46 ~]$ sqoop export --connect jdbc:mysql://Hadoop48/toplists -m 1 --table award --export-dir /user/hive/warehouse/hive_award_data --input-fields-terminated-by '\0001'
12/07/20 11:32:01 WARN mapreduce.ExportJobBase: Input path hdfs://Hadoop46:9200/user/hive/warehouse/hive_award_data contains no files

创建连接HBase的表,还是不能导入。

四、创建Hive表,将HBase外部表的数据导入

hive> CREATE TABLE hive_myaward(key string,productid int,matchid string,rank string,tourneyid string,userid bigint,gameid int,gold int,loginid string,nick string,plat string);
hive> insert overwrite table hive_myaward select * from hive_award limit 2;
hive> select * from hive_myaward;
OK
2012-04-27 06:55:00:402713629 5947 433203828 2 4027102 402713629 1001 NULL 715878221 杀破天A ios
2012-04-27 06:55:00:406788559 778 433203930 19 4017780 406788559 1001 1 13835155880 亲牛牛旦旦 android
Time taken: 2.257 seconds
[zhouhh@Hadoop46 ~]$ sqoop export --connect jdbc:mysql://Hadoop48/toplists -m 1 --table award --export-dir /user/hive/warehouse/hive_myaward --input-fields-terminated-by '\0001'
java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Access denied for user ''@'Hadoop48' to database 'toplists'

权限问题,再授权一下

mysql> GRANT ALL PRIVILEGES ON *.* TO ''@'Hadoop48';
Query OK, 0 rows affected (0.03 sec)
mysql> GRANT ALL PRIVILEGES ON *.* TO ''@'localhost';
Query OK, 0 rows affected (0.00 sec)

五、解决Hive中遇到的空值NULL的问题:

[zhouhh@Hadoop46 ~]$ sqoop export --connect jdbc:mysql://Hadoop48/toplists -m 1 --table award --export-dir /user/hive/warehouse/hive_myaward --input-fields-terminated-by '\0001'
...
12/07/20 11:49:25 INFO mapred.JobClient: map 0% reduce 0%
12/07/20 11:49:37 INFO mapred.JobClient: Task Id : attempt_201207191159_0227_m_000000_0, Status : FAILED
java.lang.NumberFormatException: For input string: "\N"
 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

\N是什么东西呢?

[zhouhh@Hadoop46 ~]$ hadoop fs -cat /user/hive/warehouse/hive_myaward/000000_0 
2012-04-27 06:55:00:4027136295947433203828240271024027136291001\N715878221杀破天Aios
2012-04-27 06:55:00:4067885597784332039301940177804067885591001113835155880亲牛牛旦旦android
hive> select * from hive_myaward;
OK
2012-04-27 06:55:00:402713629 5947 433203828 2 4027102 402713629 1001 NULL 715878221 杀破天A ios
2012-04-27 06:55:00:406788559 778 433203930 19 4017780 406788559 1001 1 13835155880 亲牛牛旦旦 android
Time taken: 2.257 seconds

由于Hive的NULL用\N来表示,字段用\01来分割,换行用\n来换行,所以需增加相应的指示,注意转义字符\:
见:https://issues.cloudera.org/browse/SQOOP-188

[zhouhh@Hadoop46 ~]$ sqoop export --connect jdbc:mysql://Hadoop48/toplists -m 1 --table award --export-dir /user/hive/warehouse/hive_myaward/000000_0 --input-null-string "\\\\N" --input-null-non-string "\\\\N" --input-fields-terminated-by "\\01" --input-lines-terminated-by "\\n"
12/07/20 12:53:56 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
12/07/20 12:53:56 INFO tool.CodeGenTool: Beginning code generation
12/07/20 12:53:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `award` AS t LIMIT 1
12/07/20 12:53:56 INFO orm.CompilationManager: HADOOP_HOME is /home/zhouhh/hadoop-1.0.0/libexec/..
注: /tmp/sqoop-zhouhh/compile/4427d3db678bb145c995073e0924dc0b/award.java使用或覆盖了已过时的 API。
注: 有关详细信息, 请使用 -Xlint:deprecation 重新编译。
12/07/20 12:53:57 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-zhouhh/compile/4427d3db678bb145c995073e0924dc0b/award.java to /home/zhouhh/./award.java
12/07/20 12:53:57 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-zhouhh/compile/4427d3db678bb145c995073e0924dc0b/award.jar
12/07/20 12:53:57 INFO mapreduce.ExportJobBase: Beginning export of award
12/07/20 12:53:58 INFO input.FileInputFormat: Total input paths to process : 1
12/07/20 12:53:58 INFO input.FileInputFormat: Total input paths to process : 1
12/07/20 12:53:58 INFO mapred.JobClient: Running job: job_201207191159_0232
12/07/20 12:53:59 INFO mapred.JobClient: map 0% reduce 0%
12/07/20 12:54:12 INFO mapred.JobClient: map 100% reduce 0%
12/07/20 12:54:17 INFO mapred.JobClient: Job complete: job_201207191159_0232
12/07/20 12:54:17 INFO mapred.JobClient: Counters: 18
12/07/20 12:54:17 INFO mapred.JobClient: Job Counters
12/07/20 12:54:17 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12114
12/07/20 12:54:17 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/20 12:54:17 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/07/20 12:54:17 INFO mapred.JobClient: Rack-local map tasks=1
12/07/20 12:54:17 INFO mapred.JobClient: Launched map tasks=1
12/07/20 12:54:17 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/07/20 12:54:17 INFO mapred.JobClient: File Output Format Counters
12/07/20 12:54:17 INFO mapred.JobClient: Bytes Written=0
12/07/20 12:54:17 INFO mapred.JobClient: FileSystemCounters
12/07/20 12:54:17 INFO mapred.JobClient: HDFS_BYTES_READ=335
12/07/20 12:54:17 INFO mapred.JobClient: FILE_BYTES_WRITTEN=30172
12/07/20 12:54:17 INFO mapred.JobClient: File Input Format Counters
12/07/20 12:54:17 INFO mapred.JobClient: Bytes Read=0
12/07/20 12:54:17 INFO mapred.JobClient: Map-Reduce Framework
12/07/20 12:54:17 INFO mapred.JobClient: Map input records=2
12/07/20 12:54:17 INFO mapred.JobClient: Physical memory (bytes) snapshot=78696448
12/07/20 12:54:17 INFO mapred.JobClient: Spilled Records=0
12/07/20 12:54:17 INFO mapred.JobClient: CPU time spent (ms)=390
12/07/20 12:54:17 INFO mapred.JobClient: Total committed heap usage (bytes)=56623104
12/07/20 12:54:17 INFO mapred.JobClient: Virtual memory (bytes) snapshot=891781120
12/07/20 12:54:17 INFO mapred.JobClient: Map output records=2
12/07/20 12:54:17 INFO mapred.JobClient: SPLIT_RAW_BYTES=123
12/07/20 12:54:17 INFO mapreduce.ExportJobBase: Transferred 335 bytes in 19.6631 seconds (17.037 bytes/sec)
12/07/20 12:54:17 INFO mapreduce.ExportJobBase: Exported 2 records.

导出到mysql成功

mysql> use toplists;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> select * from award;
+-------------------------------+-----------+-----------+------+-----------+-----------+--------+------+-------------+-------+---------+
| rowkey | productid | matchid | rank | tourneyid | userid | gameid | gold | loginid | nick | plat |
+-------------------------------+-----------+-----------+------+-----------+-----------+--------+------+-------------+-------+---------+
| 2012-04-27 06:55:00:402713629 | 5947 | 433203828 | 2 | 4027102 | 402713629 | 1001 | NULL | 715878221 | ???A | ios |
| 2012-04-27 06:55:00:406788559 | 778 | 433203930 | 19 | 4017780 | 406788559 | 1001 | 1 | 13835155880 | ????? | android |
+-------------------------------+-----------+-----------+------+-----------+-----------+--------+------+-------------+-------+---------+
2 rows in set (0.00 sec)

虽然mysql中有了数据,不过,导入的却是乱码
在《 Hive导出到Mysql中中文乱码的问题》这篇文章中继续解决。


作者:zreodown 发表于2013-4-25 16:04:20 原文链接
阅读:73 评论:1 查看评论

相关 [hbase hive 数据] 推荐:

从hbase(hive)将数据导出到mysql

- - CSDN博客云计算推荐文章
在上一篇文章《 用sqoop进行mysql和hdfs系统间的数据互导》中,提到sqoop可以让RDBMS和HDFS之间互导数据,并且也支持从mysql中导入到HBase,但从HBase直接导入mysql则不是直接支持,而是间接支持. 要么将HBase导出到HDFS平面文件,要么将其导出到Hive中,再导出到mysql.

hive中udf读写hbase

- - CSDN博客推荐文章
在大数据开发过程中经常会遇到,将hive中处理后的结果写入hbase中,每次都要写java程序会非常浪费时间,我们就想了一个办法 ,用hive的udf来实现. 只需要调用同一个udf,将表名字段名以及每一个字段的值作为udf的参数,就可以实现写hbase了. 这样大大的节省了开发时间,提升了开发效率.

hive中创建关联hbase表的几种方案_大数据_Tony_仔仔 的博客-CSDN博客

- -
有时候我们需要把已存在Hbase中的用户画像数据导到hive里面查询,也就是通过hive就能查到hbase里的数据. 但是我又不想使用sqoop或者DataX等工具倒来倒去. 这时候可以在hive中创建关联表的方式来查询hbase中的数据. 前提是:hbase中已经存在了一张表. 可选的方案:既可以在hive中关联此表的所有列簇,也可以仅关联一个列簇,也可以关联单一列蔟下的单一列,还可以关联单一列簇下的多个列.

Hive部署(包括集成Hbase和Sqoop)

- - ITeye博客
Hive部署(包括集成Hbase和Sqoop) .     主要是选择软件版本. 将解压后的hive-0.8.1文件放在系统的/home/hadoop/hive/中. 4.1 设置HADOOP_HOME. 修改hive-0.8.1目录下/conf/hive-env.sh.template中的HADOOP_HOME为实际的Hadoop安装目录.

Hive集成HBase详解 - MOBIN - 博客园

- -
Hive提供了与HBase的集成,使得能够在HBase表上使用HQL语句进行查询 插入操作以及进行Join和Union等复杂查询. 将ETL操作的数据存入HBase. HBase作为Hive的数据源. 从Hive中创建HBase表. 使用HQL语句创建一个指向HBase的Hive表. 通过HBase shell可以查看刚刚创建的HBase表的属性.

HBASE数据架构

- - 数据库 - ITeye博客
关系数据库一般用B+树,HBASE用的是LSM树. MYSQL所用类B+树一般深度不超过3层,数据单独存放,在B+树的叶节点存储指向实际数据的指针,叶节点之间也相互关联,类似双向链表. 这种结构的特点是数据更新或写入导致数据页表分散,不利于顺序访问. LSM存储中,各个文件的结构类似于B+树,但是分多个存在内存或磁盘中,更新和写入变成了磁盘的顺序写,只在合并时去掉重复或过时的数据.

同步mysql数据到hive

- - ITeye博客
地址为:http://archive.cloudera.com/cdh/3/下载相应版本,如sqoop-1.2.0-CDH3B4.tar.gz. 地址为:http://archive.cloudera.com/cdh/3/,版本可以为hadoop-0.20.2-CDH3B4.tar.gz. 3.解压 sqoop-1.2.0-CDH3B4.tar.gz ,hadoop-0.20.2-CDH3B4.tar.gz 到某目录如/home/hadoop/,解压后的目录为.

实时分析系统(HIVE/HBASE/IMPALA)浅析

- - 数据库 - ITeye博客
1. 什么是实时分析(在线查询)系统. 大数据领域里面,实时分析(在线查询)系统是最常见的一种场景,通常用于客户投诉处理,实时数据分析,在线查询等等过. 因为是查询应用,通常有以下特点:. b. 查询条件复杂(多个维度,维度不固定),有简单(带有ID). c. 查询范围大(通常查询表记录在几十亿级别).

hive中与hbase外部表join时内存溢出(hive处理mapjoin的优化器机制)

- - CSDN博客云计算推荐文章
与hbase外部表(wizad_mdm_main)进行join出现问题:. 最后在进行到0.83时,内存溢出失败. 默认情况下,Hive会自动将小表加到DistributeCache中,然后在Map扫描大表的时候,去和DistributeCache中的小表做join,这称为Mapjoin. 这里wizad_mdm_main是基于HBase的外部表,而这张表在HDFS上的源路径为 /hivedata/warehouse/wizad.db/wizad_mdm_main,实际这个目录为空,.

hbase写数据过程

- - 数据库 - ITeye博客
博文说明:1、研究版本hbase0.94.12;2、贴出的源代码可能会有删减,只保留关键的代码. 从client和server两个方面探讨hbase的写数据过程.     写数据主要是HTable的单条写和批量写两个API,源码如下:. hbase写数据的客户端核心方法是HConnectionManager的processBatchCallback方法,相关源码如下:.