如何使用Hadoop的Partitioner - 三劫散仙 - ITeye技术网站

Partitioner的作用：
对map端输出的数据key作一个散列，使数据能够均匀分布在各个reduce上进行后续操作，避免产生热点区。
Hadoop默认使用的分区函数是Hash Partitioner，源码如下：

Java代码  
/** 
 * Licensed to the Apache Software Foundation (ASF) under one 
 * or more contributor license agreements.  See the NOTICE file 
 * distributed with this work for additional information 
 * regarding copyright ownership.  The ASF licenses this file 
 * to you under the Apache License, Version 2.0 (the 
 * "License"); you may not use this file except in compliance 
 * with the License.  You may obtain a copy of the License at 
 * 
 *     http://www.apache.org/licenses/LICENSE-2.0 
 * 
 * Unless required by applicable law or agreed to in writing, software 
 * distributed under the License is distributed on an "AS IS" BASIS, 
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
 * See the License for the specific language governing permissions and 
 * limitations under the License. 
 */  
  
package org.apache.hadoop.mapreduce.lib.partition;  
  
import org.apache.hadoop.mapreduce.Partitioner;  
  
/** Partition keys by their {@link Object#hashCode()}. */  
public class HashPartitioner<K, V> extends Partitioner<K, V> {  
  
  /** Use {@link Object#hashCode()} to partition. */  
  public int getPartition(K key, V value,  
                          int numReduceTasks) {  
      //默认使用key的hash值与上int的最大值，避免出现数据溢出 的情况  
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;  
  }  
  
}  

大部分情况下，我们都会使用默认的分区函数，但有时我们又有一些，特殊的需求，而需要定制Partition来完成我们的业务，案例如下：
对如下数据，按字符串的长度分区，长度为1的放在一个，2的一个，3的各一个。

Java代码  
河南省;1  
河南;2  
中国;3  
中国人;4  
大;1  
小;3  
中;11  

这时候，我们使用默认的分区函数，就不行了，所以需要我们定制自己的Partition，首先分析下，我们需要3个分区输出，所以在设置reduce的个数时，一定要设置为3，其次在partition里，进行分区时，要根据长度具体分区，而不是根据字符串的hash码来分区。核心代码如下：

Java代码  
/** 
 * Partitioner 
 *  
 *  
 * */  
 public static class PPartition extends Partitioner<Text, Text>{   
    @Override  
    public int getPartition(Text arg0, Text arg1, int arg2) {  
         /** 
          * 自定义分区，实现长度不同的字符串，分到不同的reduce里面 
          *  
          * 现在只有3个长度的字符串，所以可以把reduce的个数设置为3 
          * 有几个分区，就设置为几 
          * */  
          
        String key=arg0.toString();  
        if(key.length()==1){  
            return 1%arg2;  
        }else if(key.length()==2){  
            return 2%arg2;  
        }else if(key.length()==3){  
            return 3%arg2;  
        }  
          
            
          
        return  0;  
    }  
       
       
       
       
 }  

全部代码如下：

Java代码  
package com.partition.test;  
  
import java.io.IOException;  
  
import org.apache.hadoop.fs.FileSystem;  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.LongWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapred.JobConf;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.mapreduce.Mapper;  
import org.apache.hadoop.mapreduce.Partitioner;  
import org.apache.hadoop.mapreduce.Reducer;  
import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;  
import org.apache.hadoop.mapreduce.lib.db.DBInputFormat;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;  
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;  
  
import com.qin.operadb.PersonRecoder;  
import com.qin.operadb.ReadMapDB;  
   
  
/** 
 * @author qindongliang 
 *  
 * 大数据交流群：376932160 
 *  
 *  
 * **/  
public class MyTestPartition {  
      
    /** 
     * map任务 
     *  
     * */  
    public static class PMapper extends Mapper<LongWritable, Text, Text, Text>{  
              
        @Override  
        protected void map(LongWritable key, Text value,Context context)  
                throws IOException, InterruptedException {  
            // System.out.println("进map了");  
            //mos.write(namedOutput, key, value);  
            String ss[]=value.toString().split(";");  
              
            context.write(new Text(ss[0]), new Text(ss[1]));  
              
              
              
        }  
          
          
    }  
      
    /** 
     * Partitioner 
     *  
     *  
     * */  
     public static class PPartition extends Partitioner<Text, Text>{   
        @Override  
        public int getPartition(Text arg0, Text arg1, int arg2) {  
             /** 
              * 自定义分区，实现长度不同的字符串，分到不同的reduce里面 
              *  
              * 现在只有3个长度的字符串，所以可以把reduce的个数设置为3 
              * 有几个分区，就设置为几 
              * */  
              
            String key=arg0.toString();  
            if(key.length()==1){  
                return 1%arg2;  
            }else if(key.length()==2){  
                return 2%arg2;  
            }else if(key.length()==3){  
                return 3%arg2;  
            }  
              
                
              
            return  0;  
        }  
           
           
           
           
     }  
       
   
     /*** 
      * Reduce任务 
      *  
      * **/  
     public static class PReduce extends Reducer<Text, Text, Text, Text>{  
         @Override  
        protected void reduce(Text arg0, Iterable<Text> arg1, Context arg2)  
                throws IOException, InterruptedException {  
               
              String key=arg0.toString().split(",")[0];  
             System.out.println("key==> "+key);  
             for(Text t:arg1){  
                 //System.out.println("Reduce:  "+arg0.toString()+"   "+t.toString());  
                 arg2.write(arg0, t);  
             }  
                 
               
        }  
           
       
           
     }  
       
       
     public static void main(String[] args) throws Exception{  
         JobConf conf=new JobConf(ReadMapDB.class);  
         //Configuration conf=new Configuration();  
         conf.set("mapred.job.tracker","192.168.75.130:9001");  
        //读取person中的数据字段  
         conf.setJar("tt.jar");  
        //注意这行代码放在最前面，进行初始化，否则会报  
       
       
        /**Job任务**/  
        Job job=new Job(conf, "testpartion");  
        job.setJarByClass(MyTestPartition.class);  
        System.out.println("模式：  "+conf.get("mapred.job.tracker"));;  
        // job.setCombinerClass(PCombine.class);  
         job.setPartitionerClass(PPartition.class);  
           
         job.setNumReduceTasks(3);//设置为3  
         job.setMapperClass(PMapper.class);  
        // MultipleOutputs.addNamedOutput(job, "hebei", TextOutputFormat.class, Text.class, Text.class);  
        // MultipleOutputs.addNamedOutput(job, "henan", TextOutputFormat.class, Text.class, Text.class);  
         job.setReducerClass(PReduce.class);  
         job.setOutputKeyClass(Text.class);  
         job.setOutputValueClass(Text.class);  
          
        String path="hdfs://192.168.75.130:9000/root/outputdb";  
        FileSystem fs=FileSystem.get(conf);  
        Path p=new Path(path);  
        if(fs.exists(p)){  
            fs.delete(p, true);  
            System.out.println("输出路径存在，已删除！");  
        }  
        FileInputFormat.setInputPaths(job, "hdfs://192.168.75.130:9000/root/input");  
        FileOutputFormat.setOutputPath(job,p );  
        System.exit(job.waitForCompletion(true) ? 0 : 1);    
           
           
    }  
      
      
  
}  
 

阅读全文……

标签 : database, hadoop, hbase, java, 大数据

发表评论

IT瘾于2014年12月6日上午12时53分00秒发布 #

发表评论发送引用通报

Re: 如何使用Hadoop的Partitioner - 三劫散仙 - ITeye技术网站 Anonymous于2025年8月8日下午12时23分57秒评论 #
标题
正文	HTML : b, strong, i, em, blockquote, br, p, pre, a href="", ul, ol, li, sub, sup
OpenID Login	(Not me?)
姓名
电子邮件
网站
记住我	是否
电邮地址不会公开在网页上，您留下的电子邮件仅用于本文有新评论时通知您（以后可以随时拿掉）。

如何使用Hadoop的Partitioner - 三劫散仙 - ITeye技术网站

Re: 如何使用Hadoop的Partitioner - 三劫散仙 - ITeye技术网站