twitter海量数据机器学习解决方案
技术关键点:hadoop, pig, stochastic gradient descent, online learning, ensembles, logistic regression
1 TWITTER’S ANALYTICS STACK
Twitter分析框架建立在hadoop集群之上,通过实时处理和批处理将数据写入到HDFS。
http://code.google.com/p/protobuf/
http://thrift.apache.org/
http://github.com/kevinweil/elephant-bird
twitter分析除了通过java写mapreduce代码实现以外,大部分是通过Pig来实现。
生产任务主要是通过Oink工作流管理系统进行调度。同时,Oink主要处理jobs之间的依赖管理。
2 Pig扩展
在twitter分析框架基础上,通过Pig扩展,使该平台具有机器学习处理能力。Twitter机器学习主要满足下面两个条件:
(1) 面对海量数据具有可伸缩性
(2) 机器学习工具可以集成到生产系统中
解决方案:
特征抽取通过UDF实现
单个学习器的内部循环封装在Pig Storage Function中
预测是根据学习训练的模型,结合UDF实现
目前有两种常用的机器学习包,Vowpal Wabbit is a fast online learner, c++实现,不好集成在twitter以jvm环境为基础的分析平台中。Mahout是java实现的机器学习包,可以将它的一些组件功能集成到Pig扩展中。Twitter开源的Elephant Bird package实现了Pig 绑定mahout中数据表示形式。
核心包
机器学习包分两部分:java核心包和轻量级Pig包装接口
它实现的功能在常用的机器学习包中均有实现,譬如Weka, Mallet,Mahout。该包中实现的分类器主要分为两种,批处理训练器接口train和在线学习器接口update。Java核心包中包含一些常用的分类器,譬如logistic regression,决策树等。
训练模型
Java中的特征向量,可以通过Pig中的map来表示
(label: int, features: map[])
另外一种表现形式是,training instances can be represented in SVMLight format
Our solution is to embed the learner inside a Pig storage function. Typically, the storage function receives output records, serializes them, and writes the resulting representation to disk.
学习器集成在Pig storage函数中,通过控制reducer数量调整模型数量。
Pig实现分类器,
training = load `training.txt' using SVMLightStorage() as (target: int, features: map[]);
store training into `model/' using FeaturesLRClassifierBuilder();
Pig storage function FeaturesLRClassifierBuilder中封装了logistic regression训练器。
Twitter机器学习算法分为两类:批处理和在线处理。
批处理学习器需要把所有数据加载到内存中,因此在训练模型之前,Pig storage 函数需要先将训练实例进行缓存处理。这里存在性能瓶颈,hadoop reduce任务分配有限的内存。在线学习器没有这样的限制,将实时的训练实例输入到学习器中,然后丢弃实例。Parameters for the model must t in memory, but in practice this is rarely a concern.
常用的数据操作,譬如在线学习模型依赖于数据实例输入的顺序,
training = foreach training generate label, features, RANDOM() as random;
training = order training by random parallel 1;
将数据集划分为训练集和测试集
data = foreach data generate target, features,RANDOM() as random;
split data into training if random <= 0.9, test if random > 0.9;
模型应用
通过Pig UDF部署模型
define Classify ClassifyWithLR(`model/');
data = load `test.txt' using SVMLightStorage() as (target: double, features: map[]);
data = foreach data generate target, Classify(features) as prediction;
ensemble模型,混合使用多种模型
define Classify ClassifyWithEnsemble(`model/', `classifier.LR', `vote');
3 SCALABLE MACHINE LEARNING
随机梯度下降
SGD是在线学习主要实现方式。这里以stochastic gradient descent (SGD) for logistic regression为例。梯度下降主要应用在batch learning中。
定义线性判别函数
Logistic 回归
规范化logistic 回归目标函数
Ensemble方法
通过Pig中的parallel关键字控制reduce数目,训练独立的模型。
4 情感分析应用
方法:logistic regression classifier learned using online stochastic gradient descent
Pig script for training binary sentiment polarity classifiers
status = load `/tables/statuses/$DATE' using TweetLoader() as (id: long, uid: long, text: chararray);
status = foreach status generate text, RANDOM() as random;
status = filter status by IdentifyLanguage(text) == `en';
-- 积极样本
positive = filter status by ContainsPositiveEmoticon(text) and not ContainsNegativeEmoticon(text) and length(text) > 20;
positive = foreach positive generate 1 as label, RemovePositiveEmoticons(text) as text, random;
positive = order positive by random; -- Randomize ordering of tweets.
positive = limit positive $N; -- Take N positive examples.
-- 消极样本
negative = filter status by ContainsNegativeEmoticon(text) and not ContainsPositiveEmoticon(text) and length(text) > 20;
negative = foreach negative generate -1 as label, RemoveNegativeEmoticons(text) as text, random;
negative = order negative by random; -- Randomize ordering of tweets
negative = limit negative $N; -- Take N negative examples
training = union positive, negative;
-- Randomize order of positive and negative examples
training = foreach training generate $0 as label, $1 as text, RANDOM() as random;
training = order training by random parallel $PARTITIONS;
training = foreach training generate label, text;
store training into `$OUTPUT' using TextLRClassifierBuilder();
其中,TextLRClassifierBuilder中封装了SGD实现的Logistic 回归模型
评价测试模型
define TextScoreLR TextScoreLR(`hdfs://path/model');
data = load `testdata' using PigStorage() as (label: int, tweet: chararray);
data = foreach data generate label, (TextScoreLR(tweet) > 0 ? 1 : -1) as prediction;
results = foreach data generate (label == prediction ? 1 : 0) as matching;
cnt = group results by matching;
cnt = foreach cnt generate group, COUNT(results);
-- Outputs number of incorrect and correct classification decisions
dump cnt;
青春就应该这样绽放 游戏测试:三国时期谁是你最好的兄弟!! 你不得不信的星座秘密