开源的Java ETL (Extraction, Transform, Load) 工具
ETL,是英文 Extract-Transform-Load 的缩写,用来描述将数据从来源端经过萃取(extract)、转置(transform)、加载(load)至目的端的过程。ETL一词较常用在数据仓库,但其对象并不限于数据仓库。
ETL所描述的过程,一般常见的作法包含ETL或是ELT(Extract-Load-Transform),并且混合使用。通常愈大量的数据、复杂的转换逻辑、目的端为较强运算能力的数据库,愈偏向使用ELT,以便运用目的端数据库的平行处理能力。
ETL(or ELT)的流程可以用任何的编程语言去开发完成,由于ETL是极为复杂的过程,而手写程序不易管理,有愈来愈多的企业采用工具协助ETL的开发,并运用其内置的metadata功能来储存来源与目的的对应(mapping)以及转换规则。
工具并可以提供较强大的连接功能(connectivity)来连接来源及目的端,开发人员不用去熟悉各种相异的平台及数据的结构,亦能进行开发。
下面是一些开源的Java ETL工具
- Octopus
- Octopus is a simple Java-based Extraction, Transform, and Loading (ETL) tool. It may connect to any JDBC data sources and perform transformations defined in an XML file. A loadjob-generator is provided to generate Octopus loadjob skeletons from an existing database. Many different types of databases can be mixed (MSSQL, Oracle, DB2, QED, JDBC-ODBC with Excel and Access, MySQL, CSV-files, XML-files,...) Three special JDBC drivers come with Octopus to support JDBC access to CSV-files (CSV-JDBC), MS-SQL (FreeTDS) and XML. Octopus supports Ant and JUnit to create a database / tables and extract/load data during a build or test process.
- Xineo
- Xineo XIL (XML Import Langage) defines an XML language for transforming various record-based data sources into XML documents, and provides a fully functional XIL processing implementation. This implementation has built-in support for relational (via JDBC) and structured text (like CSV) sources, and is extensible thanks to its public API, allowing dynamic integration of new data source implementations. It also provides an abstraction over output format, and the Xineo implementation can generate output documents into stream or as DOM document. Xineo's implementation built-in data sources include : Relational data via JDBC and Structured text via regular expressions.
- CloverETL
- CloverETL Features include internally represents all characters as 16bit, converts from most common character sets (ASCII, UTF-8, ISO-8859-1,ISO-8859-2, etc), works with delimited or fix-length data records, data records (fields) are internally handled as a variable-length data structures, fields can have default values, handles NULL values, cooperates with any database with JDBC driver, transforming of the data is performed by independent components, each running as an independent thread, framework implements so called pipeline-parallelism, metadata describing structure of data files (records) can be read from XML and transformation graphs can be read from XML
- BabelDoc
- BabelDoc is a Java framework for processing documents in linear stages, it tracks documents and can reintroduce documents back into into the pipelines, it is monitorable and configurable through a number of interfaces, it can be run standalone, in server processes or in application servers, it can be reconfigured dynamically by text files and database tables.
- Joost
- Java implementation of the Streaming Transformations for XML (STX) language. Streaming Transformations for XML (STX) is a one-pass transformation language for XML documents. STX is intended as a high-speed, low memory consumption alternative to XSLT. Since it does not require the construction of an in-memory tree, it is suitable for use in resource constrained scenarios.
- CB2XML
- CB2XML (CopyBook to XML) is a COBOL CopyBook to XML converter written in Java and based on the SableCC parser generator. This project includes utilities to convert an XML instance file into its COBOL copybook equivalent string buffer and vice versa. You can find additional information about supporting Jurasic systems here
.
- mec-eagle
- JAVA XML XSL B2B integration software:SWING based GUI,an EDI to XML, XML to XML and XML to EDI converter,client-server architecture.All EDI standards are supported:EDIFACT,ANSI X.12,SAP IDOC,XCBL,RosettaNet,Biztalk.Included comm:SMTP,FTP,HTTP(S),PGP/MIME
- Transmorpher
- Transmorpher is an environment for processing generic transformations on XML documents. It aims at complementing XSLT in order to:
- describe easily simple transformations (removing elements, replacing tag and attribute names, concatenating documents...);
- allowing regular expression transformations on the content;
- composing transformations by linking their (multiple) output to input;
- iterating transformations, sometimes until saturation (closure operation);
- integrating external transformations.
- XPipe
- XPipe is an approach to manageable, scaleable, robust XML processing based on the assembly line principle, common in many areas of manufacturing. XPipe as being an attempt to take what was great about the original Unix pipe idea and apply it for structured information streams based on XML.
- DataSift
- DataSift is a powerful java data validation and transformation framework, aimed at enterprise software development, which provides developers with an extensible architecture they can fully adapt. Almost every feature in it can be configured and extended in some way.
- Xephyrus Flume
- Flume is a component pipeline engine. It allows you to chain together multiple workers into a pipeline mechanism. The intention of Flume is that each of the workers would provide access to a different type of technology. For example, a pipeline could consist of a Jython script worker followed by a BeanShell script worker followed by an XSLT worker.
- Smallx - Smallx supports streaming of XML infosets to allow processing of very large documents (500MB-1GB). Processing is specified in an XML syntax that describes an XML pipeline--which is a sequence of components that consume and produce infosets. This allows chaining of XML component standards like XSLT. Also, there is a full component API that allows developers to easily write their own components.
- Nux
- Nux is a toolkit making efficient and powerful XML processing easy. It is geared towards embedded use in high-throughput XML messaging middleware such as large-scale Peer-to-Peer infrastructures, message queues, publish-subscribe and matchmaking systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc. Nux reliably processes whatever data fits into main memory (even, say, 250 MB messages), but it is not an XML database system, and does not attempt to be one. Nux integrates best-of-breed components, containing extensions of the XOM, Saxon and Lucene open-source libraries.
- KETL
- KETL is an extract, transform, and load (ETL) tool designed by Kinetic Networks. KETL includes job scheduling and alerting capabilities. The KETL Server is a Java-based data integration platform consisting of a multi-threaded server that manages various job executors. Jobs are defined using an XML definition language.
- Kettle
- K.E.T.T.L.E (Kettle ETTL Environment) is a meta-data driven ETTL tool. (ETTL: Extraction, Transformation, Transportation & Loading). No code has to be written to perform complex data transformations. Environment means that it is possible to create plugins to do custom transformations or access propriatary data sources. Kettle supports most databases on the market and has native support for slowly chaning dimensions on most platforms. The complete Kettle source code is over 160,000 lines of java code.
- Netflux
- Metadata based tool to allow for easier manipulations. Spring based configuration, BSF based scripting support, pluggable JDBC based data sources and sinks. A server and a GUI are planned.
- OpenDigger
- OpenDigger is a java based compiler for the xETL language. xETL is a language specifically projected to read, manipulate and write data in any format and database. With OpenDigger/XETL you can build Extraction-Transformation-Loading (ETL) programs virtually from and to any database platform.
- ServingXML
- ServingXML is a markup language for expressing XML pipelines, and an extensible Java framework for defining the elements of the language. It defines a vocabulary for expressing flat-XML, XML-flat, flat-flat, and XML-XML transformations in pipelines. ServingXML supports reading content as XML files, flat files, SQL queries or dynamically generated SAX events, transforming it with XSLT stylesheets and custom SAX filters, and writing it as XML, HTML, PDF or mail attachments. ServingXML is suited for converting flat file or database records to XML, with its support for namespaces, variant record types, multi-valued fields, segments and repeating groups, hierarchical grouping of records, and record-by-record validation with XML Schema.
- Talend
- Talend Open Studio is full-featured Data Integration OpenSource solution (ETL). Its graphical user interface, based on Eclipse Rich Client Platform (RCP) includes numerous components for business process modelling, as well as technical implementations of extracting, transformation and mapping of data flows. Data related script and underlying programs are generated in Perl and Java code.
- Scriptella
- Scriptella is an ETL and script execution tool. Its primary focus is simplicity. It doesn't require the user to learn another complex XML-based language to use it, but allows the use of SQL or another scripting language suitable for the data source to perform required transformations.
- ETL Integrator
- ETL (a highly unimaginative name) consists of 3 components. An ETL service engine that is a JBI compliant service engine implementation which can be deployed in a JBI container. An ETL Editor that is a design time netbeans module which allow users to design ETL process in a graphical way. An ETL Project that is a design time netbeans module which allows users to package ETL related artifacts in a jar file which could be deployed onto the ETL service engine.
- Jitterbit
- Jitterbit can act as a powerful ETL tool. Operations are defined, configured, and monitored with a GUI. The GUI can create document definitions, from simple flat file structures to complex hierarchic files structures. Jitterbit includes drag-and-drop mapping tool to transform data between your various system interfaces. Furthermore, one can set schedules, create success and failure events and track the results for your integration operations. Jitterbit supports Web Services, XML Files, HTTP/S, FTP, ODBC, Flat and Hierarchic file structures and file shares.
- Apatar
- Apatar integrates databases, files and applications. Apatar includes a visual job designer for defining mapping, joins, filtering, data validation and schedules. Connectors include MySQL, PostgreSQL, Oracle, MS SQL, Sybase, FTP, HTTP, SalesForce.com, SugarCRM, Compiere ERP, Goldmine CRM, XML, flat files, Webdav, Buzzsaw, LDAP, Amazon and Flickr. No coding is required to accomplish even a complex integration. All metadata is stored in XML.
- Spring Batch
- Spring Batch is a lightweight, comprehensive batch framework designed to enable the development of robust batch applications. Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management. It also provides more advance technical services and features that will enable extremely high-volume and high performance batch jobs though optimization and partitioning techniques.
- JasperETL
- JasperETL was developed through a technology partnership with Talend. JasperETL includes Eclipse based user interfaces for process design, transformation mapping, debugging, process viewing. The project includes over 30 connectors like flat files, xml, databases, email, ftp and more. It includes wizards to help configure the processing of complex file formats including positional, delimited, CSV, RegExp, XML, and LDIF formatted data.
- Pentaho Data Integration
- Pentaho Data Integration provides a declarative approach to ETL where you specify what to do rather than how to do it. It includes a transformation library with over 70 mapping objects. In includes data warehousing capability for slowly changing and junk Dimensions. Includes support for multiple data sources including over 25 open source and proprietary database platforms, flat files, Excel documents, and more. The architecture is extensible with a plug-in mehcanism.
- Mural - Mural is an open source community with the purpose of developing an ecosystem of products that solve the problems in Master Data Management (MDM). Projects include: Master Index Studio which provides the supports the creation of a master index through the matching, de-duplication, merging, and cleansing . Data Integrator which provides extract, transform, load capability and a wide variety of data formats. Data Quality which features matching, standardization, profiling,and cleansing capabilities. Data Mashup Data Mashup which provides data mashup capability. Data Migrator which supports the migration of database objects across database instances
- Smooks
- Smooks provides a wide range of Data Transforms. Supports many different Source and Result types - XML/CSV/EDI/Java/JSON to XML/CSV/EDI/Java/JSON. It supports binding of Java Object Models from any data source. It is designed to process huge messages in the GByte range.
- Data Pipeline
- Data Pipeline provides data conversion, data processing, and data transformation. The toolkit has readers and writers for common file formats (CSV, Excel, Fixed-width, JDBC) along with decorators that can be chained together to process and transform data (filter, remove duplicates, lookups, validation).
Ubuntu9.04安装配置
编辑 interfaces文件。
$ sudo vi /etc/network/interfaces
eth0配置如下:
auto eth0
address 192.168.1.123
netmask 255.255.255.0
gateway 192.168.1.1
保存退出后,使用重启networking命令让新配置生效。
$ sudo /etc/init.d/networking restart
也可以通过如下命令重启网卡,让新配置生效,好处是不影响其他网络接口。
$ sudo ifdown eth0
$ sudo ifup eth0
如果只是要临时改变IP地址,则不用修改interface.只用ifconfig使用即可,不过当系统重启动后,系统后会恢复interfaces中的配置上。
$ sudo ifconfig eth0 192.168.1.111 netmask 255.255.255.0
设置DNS:
$ sudo vi /etc/resolv.conf
nameserver 61.235.70.252
nameserver 211.98.4.1
$ sudo apt-get update
中文操作系统环境和中文输入法,操作系统在连上互联网后会自动提示安装中文环境和中文输入法。
1. sudo apt-get install xinetd telnetd
2. 安装成功后,系统也会有相应提示(好象7.10才有,6.10就没看到)
sudo vi /etc/inetd.conf并加入以下一行
telnet stream tcp nowait telnetd /usr/sbin/tcpd /usr/sbin/in.telnetd
3. sudo vi /etc/xinetd.conf并加入以下内容:
# Simple configuration file for xinetd
#
# Some defaults, and include /etc/xinetd.d/
defaults
{
# Please note that you need a log_type line to be able to use log_on_success
# and log_on_failure. The default is the following :
# log_type = SYSLOG daemon info
instances = 60
log_type = SYSLOG authpriv
log_on_success = HOST PID
log_on_failure = HOST
cps = 25 30
}
includedir /etc/xinetd.d
4. sudo vi /etc/xinetd.d/telnet并加入以下内容:
# default: on
# description: The telnet server serves telnet sessions; it uses
# unencrypted username/password pairs for authentication.
service telnet
{
disable = no
flags = REUSE
socket_type = stream
wait = no
user = root
server = /usr/sbin/in.telnetd
log_on_failure += USERID
}
5. 重启机器或重启网络服务sudo /etc/init.d/xinetd restart
/etc/environment文件如下:
代码:
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games"
LANG="zh_CN.GBK"
LANGUAGE="zh_CN:zh:en_US:en"
LC_CTYPE=zh_CN.GBK
LC_ALL=zh_CN.GBK
GST_ID3_TAG_ENCODING=GBK
ID3_TAG_ENCODING=GBK
LANG="zh_CN.GBK"
LANGUAGE="zh_CN:zh:en_US:en"
LC_CTYPE=zh_CN.GBK
LC_ALL=zh_CN.GBK
GST_ID3_TAG_ENCODING=GBK
ID3_TAG_ENCODING=GBK
#sudo vi /etc/profile 文件
在最后加入一句
export LANG=en_US
在最后加入一句
export LANG=en_US
输入:
sudo apt-get install vsftpd
如果没换源可能会提示你使用光盘,插入光盘
sudo cp /etc/vsftpd.conf /etc/vsftpd.conf.old
然后可以改动了:
sudo vi /etc/vsftpd.conf
#不让匿名用户使用
#anonymous_enable=YES
#本地用户可用
local_enable=YES
#可用写操作
write_enable=YES
#不需要显示某目录下文件信息
#dirmessage_enable=YES
#加点banner提示
ftpd_banner=Hello~~
#FTP服务器最大承载用户
max_clients=100
启动ftp服务
#sudo service vsftpd start
安装zip:
#sudo apt-get install zip unzip
方法有三:
一、临时设置
export JAVA_HOME= /home/liupinghua/jdk1.5.0_18
一、临时设置
export JAVA_HOME= /home/liupinghua/jdk1.5.0_18
二、当前用户的全局设置
打开~/.bashrc,添加行:
export JAVA_HOME= /home/liupinghua/jdk1.5.0_18
注销
这样每次以此用户登录Ubuntu,该环境变量都会生效。
打开~/.bashrc,添加行:
export JAVA_HOME= /home/liupinghua/jdk1.5.0_18
注销
这样每次以此用户登录Ubuntu,该环境变量都会生效。
三、所有用户的全局设置
$ vi /etc/profile
在里面加入:
export JAVA_HOME= /home/liupinghua/jdk1.5.0_18
注销
这样不管是以哪个用户登录,该环境变量都会生效。
$ vi /etc/profile
在里面加入:
export JAVA_HOME= /home/liupinghua/jdk1.5.0_18
注销
这样不管是以哪个用户登录,该环境变量都会生效。
#sudo ./startup.sh
l 设置开机自动启动Tomcat
1)、使用tomcat自带的jsvc工具,生成脚本使tomcat自动启动
cd tomcat/bin
tar -zxvf jsvc.tar.gz
cd jsvc-src
chmod +x configure
./configure --with-java=$JAVA_HOME
make
cd native
gedit Tomcat5.sh
--根据需要修改下面文件的参数
tar -zxvf jsvc.tar.gz
cd jsvc-src
chmod +x configure
./configure --with-java=$JAVA_HOME
make
cd native
gedit Tomcat5.sh
--根据需要修改下面文件的参数
# Adapt the following lines to your configuration
JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun
CATALINA_HOME=/home/user/tomcat/
DAEMON_HOME=/home/user/tomcat/
TOMCAT_USER=user
#为tomcat设置一个启动用户,非root用户
#出于安全性考虑,如果是root用户,jsp执行权限太高,有被注入的问题
# for multi instances adapt those lines.
TMP_DIR=/var/tmp
PID_FILE=/var/run/jsvc.pid
CATALINA_BASE=/home/user/tomcat/
#CATALINA_OPTS="-Djava.library.path=/home/jfclere/jakarta-tomcat-connectors/jni/native/.libs"
CLASSPATH=
$JAVA_HOME/lib/tools.jar:
$CATALINA_HOME/bin/commons-daemon.jar:
$CATALINA_HOME/bin/bootstrap.jar
case "$1" in
start)
#
# Start Tomcat
#
$DAEMON_HOME/bin/jsvc-src/jsvc
-user $TOMCAT_USER
-home $JAVA_HOME
-Dcatalina.home=$CATALINA_HOME
-Dcatalina.base=$CATALINA_BASE
-Djava.io.tmpdir=$TMP_DIR
-wait 10
-outfile $CATALINA_HOME/logs/catalina.out
-errfile '&1'
$CATALINA_OPTS
-cp $CLASSPATH
org.apache.catalina.startup.Bootstrap
#
# To get a verbose JVM
#-verbose
# To get a debug of jsvc.
#-debug
exit $?
;;
stop)
#
# Stop Tomcat
#
$DAEMON_HOME/bin/jsvc-src/jsvc
-stop
org.apache.catalina.startup.Bootstrap
exit $?
;;
*)
echo "Usage tomcat.sh start/stop"
exit 1;;
esac
JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun
CATALINA_HOME=/home/user/tomcat/
DAEMON_HOME=/home/user/tomcat/
TOMCAT_USER=user
#为tomcat设置一个启动用户,非root用户
#出于安全性考虑,如果是root用户,jsp执行权限太高,有被注入的问题
# for multi instances adapt those lines.
TMP_DIR=/var/tmp
PID_FILE=/var/run/jsvc.pid
CATALINA_BASE=/home/user/tomcat/
#CATALINA_OPTS="-Djava.library.path=/home/jfclere/jakarta-tomcat-connectors/jni/native/.libs"
CLASSPATH=
$JAVA_HOME/lib/tools.jar:
$CATALINA_HOME/bin/commons-daemon.jar:
$CATALINA_HOME/bin/bootstrap.jar
case "$1" in
start)
#
# Start Tomcat
#
$DAEMON_HOME/bin/jsvc-src/jsvc
-user $TOMCAT_USER
-home $JAVA_HOME
-Dcatalina.home=$CATALINA_HOME
-Dcatalina.base=$CATALINA_BASE
-Djava.io.tmpdir=$TMP_DIR
-wait 10
-outfile $CATALINA_HOME/logs/catalina.out
-errfile '&1'
$CATALINA_OPTS
-cp $CLASSPATH
org.apache.catalina.startup.Bootstrap
#
# To get a verbose JVM
#-verbose
# To get a debug of jsvc.
#-debug
exit $?
;;
stop)
#
# Stop Tomcat
#
$DAEMON_HOME/bin/jsvc-src/jsvc
-stop
org.apache.catalina.startup.Bootstrap
exit $?
;;
*)
echo "Usage tomcat.sh start/stop"
exit 1;;
esac
将修改的文件复制到/etc/init.d/中
cp Tomcat5.sh /etc/init.d/tomcat.sh
修改执行权限
sudo chmod +x tomcat.sh
这样tomcat就会随着系统自动启用
测试:
sudo /etc/init.d/tomcat.sh start
sudo /etc/init.d/tomcat.sh stop
cp Tomcat5.sh /etc/init.d/tomcat.sh
修改执行权限
sudo chmod +x tomcat.sh
这样tomcat就会随着系统自动启用
测试:
sudo /etc/init.d/tomcat.sh start
sudo /etc/init.d/tomcat.sh stop
给硬盘分区
在slackware下有两个分区软件fdisk和cfdisk
例如我们已经有一个硬盘了,现在添加另一个硬盘到系统
那么我们根据命名规则知道这个新添加的硬盘应该是hdb。我们用下面命令给硬盘分区
在slackware下有两个分区软件fdisk和cfdisk
例如我们已经有一个硬盘了,现在添加另一个硬盘到系统
那么我们根据命名规则知道这个新添加的硬盘应该是hdb。我们用下面命令给硬盘分区
fdisk /dev/hdb
你也可以用cfdisk来分区,命令如下
cfdisk /dev/hdb
格式化硬盘
格式化成ext3格式
mkfs.ext3 /dev/hdb1
格式化成reiserfs的格式
mkfs.reiserfs /dev/hdb1
让硬盘启动自动挂载
例如挂载/dev/hdb1分区到/mnt/hd目录下
用vi编辑/etc/fstab文件,加入如下内容
/dev/dhb1 /mnt/hd reiserfs defaults 1 1
在/etc/environment这个文件里面可以设置全局的LANG变量
$ cat /etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11: /usr/games" LANG="zh_CN.UTF-8" LANGUAGE="zh_CN:zh:en_US:en" |
但是当我们sudo -i进root用户时, LANG又变成了C
# locale
LANG=C
LANGUAGE=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
这是root的~/.profile里面的设置造成:
~# cat .profile
# Installed by Debian Installer:
# no localization for root because zh_CN.UTF-8
# cannot be properly displayed at the Linux console
LANG=C
LANGUAGE=C
因为有些情况下显示有问题, 所有root中强制设置成了LANG=C.
我收藏的链接(39)