MapReduce里面的二次排序、组排序和Partitioner - FacingTheSunCN的专栏 - 博客频道 - CSDN.NET

在MapReduce程序中，我们常常需要对属于同一个key的value进行排序，即“二次排序”，将key和value进行组合，合并成一个新的key，给map去排序。在Hadoop 1.0.4中，利用setSortComparatorClass()对二次排序进行设定，但是sort comparator需要自己实现一个comparator，下面是一个自己实现的comparator的例子。

[java]view plaincopy 
public static class SortComparator extends WritableComparator {  
  
    protected SortComparator() {  
        super(Text.class, true);  
          
        // TODO Auto-generated constructor stub  
    }  
  
    @Override  
    public int compare(WritableComparable a, WritableComparable b) {  
        // TODO Auto-generated method stub  
        String[] strs_a = ((Text) a).toString().split(":");  
        String[] strs_b = ((Text) b).toString().split(":");  
  
        if ((strs_a.length != 3) || (strs_b.length != 3)) {  
            log.error("Error: dimension error 1 in SortComparator!");  
            System.exit(1);  
        }  
  
        if (Integer.parseInt(strs_a[0]) > Integer.parseInt(strs_b[0])) {  
            return 1;  
        } else if (Integer.parseInt(strs_a[0]) < Integer  
                .parseInt(strs_b[0])) {  
            return -1;  
        } else {  
            if (Double.parseDouble(strs_a[1]) > Double  
                    .parseDouble(strs_b[1])) {  
                return 1;  
            } else {  
                return -1;  
            }  
        }  
    }  
}  

然后，在job中设置

[java]view plaincopy 
job.setSortComparatorClass(SortComparator)  

由于我们使用了“二次排序”，因此现在的key是被合并过的key（上面说过，是将key与value合并成新的key），所以我们需要定义组比较器（grouping comparator），它的功能是在reducer中为我们需要的相同的key（即合并之前的key）送入到同一个reduce中（官方文档中的描述是“Define the comparator that controls which keys are grouped together for a single call to Reducer.reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context)”）。下面是一个grouping comparator的例子。

[java]view plaincopy 
public static class GroupComparator extends WritableComparator {  
  
    protected GroupComparator() {  
        super(Text.class, true);  
        // TODO Auto-generated constructor stub  
    }  
  
    @Override  
    public int compare(WritableComparable a, WritableComparable b) {  
        // TODO Auto-generated method stub  
        String[] strs_a = ((Text) a).toString().split(":");  
        String[] strs_b = ((Text) b).toString().split(":");  
  
        if ((strs_a.length != 3) || (strs_b.length != 3)) {  
            log.error("Error: dimension error 1 in GroupComparator!");  
            System.exit(1);  
        }  
  
        String new_key_a = strs_a[0] + strs_a[2];  
        String new_key_b = strs_b[0] + strs_b[2];  
  
        if (new_key_a.compareTo(new_key_b) == 0) {  
            return 0;  
        } else if (new_key_a.compareTo(new_key_b) > 0) {  
            return 1;  
        } else {  
            return -1;  
        }  
  
    }  
}  

然后，在job中设置

[java]view plaincopy 
job.setGroupingComparatorClass(GroupComparator.class);  

此外，由于我们实际的key与我们所需要的key是不一样的，因此我们需要自己定义一个partitioner，以“欺骗”reducer，将我们所需的相同的key传到同一个reducer，下面是一个partitioner的例子。

[java]view plaincopy 
public static class Patitioner extends  
        HashPartitioner<Text, IntWritable> {  
    @Override  
    public int getPartition(Text key, IntWritable value, int numReduceTasks) {  
        // TODO Auto-generated method stub  
        String[] new_key = key.toString().split(":");  
        if (new_key.length != 3) {  
            log.error("Error: dimension error in partitioner!");  
            System.exit(1);  
        }  
        return super.getPartition(new Text(new_key[0]), value,  
                numReduceTasks);  
    }  
}  

然后，在job中设置

[java]view plaincopy 
job.setPartitionerClass(Patitioner.class);  

Partitioner和GroupingComparator有点饶人，功能好像重复了。

Partitioner是将相同的key（用户虚拟的key）传到同一个reducer（到了reducer中，reducer只认map中实际输出的key，实际key中哪一部分作为key用一个单独的reduce来处理就是GroupingComparator的功能）
GroupingComparator是让reducer用一个单独的reduce来处理同一个key
Partitioner中的key和GroupingComparator中的key是可以不一样的（例如我的例子中）

阅读全文……

标签 : database, hadoop, tech

发表评论

IT瘾于2013年6月21日下午02时39分00秒发布 #

发表评论发送引用通报

Re: MapReduce里面的二次排序、组排序和Partitioner - FacingTheSunCN的专栏 - 博客频道 - CSDN.NET Anonymous于2025年11月3日下午03时07分37秒评论 #
标题
正文	HTML : b, strong, i, em, blockquote, br, p, pre, a href="", ul, ol, li, sub, sup
OpenID Login	(Not me?)
姓名
电子邮件
网站
记住我	是否
电邮地址不会公开在网页上，您留下的电子邮件仅用于本文有新评论时通知您（以后可以随时拿掉）。

MapReduce里面的二次排序、组排序和Partitioner - FacingTheSunCN的专栏 - 博客频道 - CSDN.NET

Re: MapReduce里面的二次排序、组排序和Partitioner - FacingTheSunCN的专栏 - 博客频道 - CSDN.NET