hadoop实现共同出现的单词(Word co-occurrence)

标签: hadoop 单词 word | 发表时间:2013-07-24 23:12 | 作者:doc_sgl
出处:http://blog.csdn.net

共同出现的单词(Word co-occurrence)是指在一个句子中相邻的两个单词。每一个相邻的单词就是一个Co-Occurrence对。

Sample Input:

a b cc, c d d c
I Love U.
dd ee f g s sa dew ad da
So shaken as we are, so wan with care.
Find we a time for frighted peace to pant.
And breathe short-winded accents of new broil.
To be commenced in strands afar remote.
I Love U U love i.
i i i i

Sample Output:

a:b 1
a:time 1
a:we 1
accents:of 1
accents:short-winded 1
ad:da 1
ad:dew 1
afar:remote 1
afar:strands 1
and:breathe 1
are:so 1
are:we 1
as:shaken 1
as:we 1
b:cc 1
be:commenced 1
be:to 1
breathe:short-winded 1
broil:new 1
c:cc 1
c:d 2
care:with 1
commenced:in 1
d:d 1
dd:ee 1
dew:sa 1
ee:f 1
f:g 1
find:we 1
for:frighted 1
for:time 1
frighted:peace 1
g:s 1
i:i 3
i:love 3
in:strands 1
love:u 3
new:of 1
pant:to 1
peace:to 1
s:sa 1
shaken:so 1
so:wan 1
u:u 1
wan:with 1

Code:

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.WritableUtils;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.GenericOptionsParser;

public class CoOccurrence {


  public static class TextPair implements WritableComparable<TextPair> {
    private Text first;
    private Text second;
    
    public TextPair(){
    	set(new Text(), new Text());
    }
    public TextPair(String left, String right) {
        set(new Text(left), new Text(right));
    }
    public TextPair(Text left, Text right) {
    	set(left, right);
    }
    
    public void set(Text left, Text right){
    	String l = left.toString();
    	String r = right.toString();
    	int cmp = l.compareTo(r);    	
    	if(cmp <= 0){
    		this.first = left;
    		this.second = right;
    	}else{
    		this.first = right;
    		this.second = left;
    	}
    }
    
    public Text getFirst() {
      return first;
    }
    public Text getSecond() {
      return second;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
      first.readFields(in);
      second.readFields(in);
    }
    @Override
    public void write(DataOutput out) throws IOException {
    	first.write(out);
    	second.write(out);
    }
    @Override
    public int hashCode() {
      return first.hashCode() * 163 + second.hashCode();//May be some trouble here. why 163? sometimes 157
    }
    @Override
    public boolean equals(Object o) {
      if (o instanceof TextPair) {
        TextPair tp = (TextPair) o;
        return first.equals(tp.first) && second.equals(tp.second);
      }
      return false;
    }
    @Override
    public String toString(){
    	return first + ":" + second;
    }
    @Override
    public int compareTo(TextPair tp) {
    	int cmp = first.compareTo(tp.first);
    	if(cmp != 0)
    		return cmp;
    	return second.compareTo(tp.second);
    }

    // A Comparator that com.pares serialized StringPair.  
    public static class Comparator extends WritableComparator {
    	private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
    	public Comparator() {
    		super(TextPair.class);
    	}
    	@Override
    	public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2){
    		try {
    			int firstl1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
    			int firstl2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
    			int cmp = TEXT_COMPARATOR.compare(b1, s1, firstl1, b2, s2, firstl2);
    			if(cmp != 0)
    				return cmp;
    			return TEXT_COMPARATOR.compare(b1, s1 + firstl1, l1 - firstl1,
    										   b2, s2 + firstl2, l1 - firstl2);
    		}catch (IOException e) {
    			throw new IllegalArgumentException(e);
    		}
    	}
    }//End of Comparator
    static { // register this comparator
      WritableComparator.define(TextPair.class, new Comparator());
    }

    // Compare only the first part of the pair, so that reduce is called once for each value of the first part.
    public static class FirstComparator extends WritableComparator {
    	private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
    	public FirstComparator() {
    		super(TextPair.class);
    	}  	
    	@Override
    	public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2){
    		try {
    			int firstl1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
    			int firstl2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
    			return TEXT_COMPARATOR.compare(b1, s1, firstl1, b2, s2, firstl2);
    		}catch (IOException e) {
    			throw new IllegalArgumentException(e);
    		}
    	}
    	/*
      @Override
      public int compare(WritableComparator a, WritableComparator b) {
      	if(a instanceof TextPair && b instanceof TextPair)
      		return ((TextPair)a).first.compareTo(((TextPair)b).first);
      	return super.compare(a, b);
      }*/
    }//End of FirstComparator    
  }//End of TextPair
  
  //Partition based on the first part of the pair.
  public static class FirstPartitioner extends Partitioner<TextPair,IntWritable>{
    @Override
    public int getPartition(TextPair key, IntWritable value, int numPartitions) {
      return Math.abs(key.getFirst().toString().indexOf(0) * 127) % numPartitions;//May be some trouble here.
    }
  }//End of FirstPartitioner

  public static class MyMapper extends Mapper<LongWritable, Text, TextPair, IntWritable> {    
    private final static IntWritable one = new IntWritable(1);
    private static Text word0 = new Text();
    private static Text word1 = new Text();
    private String pattern = "[^a-zA-Z0-9-']";

    @Override
    public void map(LongWritable inKey, Text inValue, Context context)throws IOException, InterruptedException {
    	String line = inValue.toString();
    	line = line.replaceAll(pattern, " ");
    	line = line.toLowerCase();
    	String[] str = line.split(" +");
    	for(int i=0; i< str.length-1; i++)
    	{
    		word0.set(str[i]);
    		word1.set(str[i+1]);
    		TextPair pair = new TextPair(word0, word1);
    		context.write(pair, one);
    	}
    }
  }//End of MapClass
  public static class MyReducer extends Reducer<TextPair, IntWritable, TextPair, IntWritable> {
	    private IntWritable result = new IntWritable();
	    
	    @Override
	    public void reduce(TextPair inKey, Iterable<IntWritable> inValues, Context context) throws IOException, InterruptedException {
	    	int sum = 0;
		      for (IntWritable val : inValues) {
		        sum += val.get();
		      }
		      result.set(sum);
		      context.write(inKey, result);
	    }
  }//End of MyReducer
  
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    //conf.set("Hadoop.job.ugi", "sunguoli,cs402");
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    //if (otherArgs.length != 2) {
    //  System.err.println("Usage: CoOccurrence <in> <out>");
    //  System.exit(2);
    //}
    Job job = new Job(conf, "Co-Occurrence");
    job.setJarByClass(CoOccurrence.class);
    
    job.setMapperClass(MyMapper.class);
    job.setMapOutputKeyClass(TextPair.class);
    job.setMapOutputValueClass(IntWritable.class);
    
    job.setCombinerClass(MyReducer.class);

    // group and partition by the first int in the pair
    //job.setPartitionerClass(FirstPartitioner.class);
    //job.setGroupingComparatorClass(FirstGroupingComparator.class);

    // the reduce output is Text, IntWritable
    job.setReducerClass(MyReducer.class);
    job.setOutputKeyClass(TextPair.class);
    job.setOutputValueClass(IntWritable.class);
    
    //FileInputFormat.addInputPath(job, new Path("../shakespeareinput"));
    //FileOutputFormat.setOutputPath(job, new Path("output"));
	FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }//End of main
}//End of CoOccurrence

作者:doc_sgl 发表于2013-7-24 23:12:01 原文链接
阅读:107 评论:0 查看评论

相关 [hadoop 单词 word] 推荐:

hadoop实现共同出现的单词(Word co-occurrence)

- - CSDN博客云计算推荐文章
共同出现的单词(Word co-occurrence)是指在一个句子中相邻的两个单词. 每一个相邻的单词就是一个Co-Occurrence对. // Compare only the first part of the pair, so that reduce is called once for each value of the first part.

word wrap 解惑

- 大狗 - Taobao UED Team
我们经常需要“修复”一个老生常谈的“bug”,那就是文本的自动换行问题. 在专业术语上,这种期望得到的渲染现象被称作“word wrap”,即文本处理器有能力把超出页边的整个词自动传到下一行. 在现实项目中,尤其是在测试阶段,鉴于测试使用非常极端的测试用例,我们经常需要“修复”如图所示的这个问题:.

Struts导出word

- - CSDN博客Web前端推荐文章
 * @param tableSize 多少列(列数). // 设置 Table 表格. aTable.setWidths(width);// 设置每列所占比例. aTable.setWidth(100); // 占页面宽度 90%. aTable.setAlignment(Element.ALIGN_CENTER);// 居中显示.

freemarker生成word

- - 开源软件 - ITeye博客
freemarker生成word.          利用freemarker生成word,在项目中有用到,就单独写个测试以及用法列出来,欢迎圈错,共同学习.       一、应用场景和效果图.             1.应用场景:.                    a.xx项目里面需要定期生成xx报告,记录最近xx情况.

Word操作技巧(一)

- Gene - 完美Excel
上周在分部内为同事进行了一场Word操作技巧培训,引起了大家比较强烈的反响,很多人都惊讶于每天使用的Word有如此多的技巧和功能,对Word又有了重新的认识. 通过这次培训,也使我认识到,虽然大家经常使用Word,但对其的了解还远远不够,以致于如此一款优秀的软件,没有得到很好的使用,甚至得到了许多误解.

POI读写Word docx文件

- - 开源软件 - ITeye博客
使用 POI 读写 word docx 文件. 1     读docx文件. 1.1     通过XWPFWordExtractor读. 1.2     通过XWPFDocument读. 2     写docx文件. 2.1     直接通过XWPFDocument生成. 2.2     以docx文件作为模板.

用 Word 整理你的 My Clippings

- Roger - Page to Page
Kindle 3的笔记和高亮都记录在\documents\My Clippings.txt文件中,直接打开可以看到每个书摘的内容、时间、和书籍名称等. 不过是以文本的方式记录的,阅读起来并不是很直观. 这里我们用Word简单的将其整理成表格,使其更易于阅读和收藏. 用Word打开My Clippings.txt文件.

这才是word processor的未来

- 三十不归 - hUrR DuRr
有的时候我一只在想,为什么Word, WPS, WordPefect这样的东西,居然是面向 文字 这样反生产力的东西. Hyper-text processor才是文字处理类的未来. 以后所谓的纸质写作,只是把信息二次元线性化的处理过程. Word里的macro和“域”可以做到近似的效果,但是永远没法做到多级联动,和社会化协作.

java导出word之freemarker导出

- - 企业架构 - ITeye博客
       一,简单模板导出(不含图片, 不含表格循环).          1, 新建一个word文档, 输入如下类容:.          2, 将该word文件另存为xml格式(注意是另存为,不是直接改扩展名).          3, 将xml文件的扩展名直接改为ftl.          4, 用java代码完成导出(需要导入freemarker.jar).