hadoop倒排索引

标签: hadoop 倒排索引 | 发表时间:2013-05-13 23:50 | 作者:limiteeWALTWO
出处:http://blog.csdn.net

看到很多的hadoop关于倒排索引的例子,但是我想写一个属于我自己的,加入了本人对于hadoop中mapreduce的理解。

有下面三篇文章:

accident.txt

CHENGDU - Death toll from a colliery blast on Saturday in southwest China's Sichuan Province rose to 27, local authorities said.
As of 11:13 pm, 81 miners were rescued. Sixteen of them were injured and are treated in local hospitals, sources said.
The accident occurred at around 2 pm in Taozigou coal mine, Luxian County in the city of Luzhou, according to an official statement.
An investigation into the accident is underway.
It is the second coal mine accident in 24 hours in the country.
On Friday evening, 12 miners were killed and two others injured in a colliery gas explosion in southwest China's Guizhou Province, local authorities said on Saturday.
Taozigou coal mine [Photo/Xinhua]
Taozigou coal mine [Photo/Xinhua]

million.txt

NANCHANG - Rainstorms have battered southern and eastern China over the past five days, killing six people in Hunan Province, local authorities said Saturday.
Contiuous strong rain started to hit the central China province on Monday killing six people, the Hunan provincial flood prevention and drought control headquarters said.
As of Saturday, rainstorms have affected about 850,000 people, toppled more than 2,200 homes and forced 14,000 citizens to relocate in Hunan.
Heavy rainfall has also led to the flooding of major reservoirs and rivers.
Rainstorms have affected about 196,800 people in east China's Jiangxi Province, local authorities said Saturday.
As of 11 p.m. Friday, the heavy rain, which started from Tuesday, has battered 26 counties in Jiangxi, the provincial flood prevention and drought control headquarters said.
Local governments have relocated 6,019 residents to avoid potential risks.
The downpours have also damaged or destroyed 202 houses and ruined 16,710 hectares of crops, as well as causing high water levels of rivers and lakes, and several landslides.
Flood prevention authorities in Jiangxi warned of floods on Thursday due to the rising rivers and lakes.
The headquarters also ordered several reservoirs in the province to release water as levels had gone over or were approaching alarm lines because of the heavy rain.
No casualties have been reported as a result of the rainfall in Jiangxi.

Philippines.txt

TAIPEI - Taipei mayor Hau Lung-bin announced on Saturday the suspension of inter-city exchanges with the Philippines after a Taiwanese fisherman was shot dead by Philippine coast guards at sea.
The Philippines will also not be allowed to take part in Dragon Boat Festival races in Taipei on June 12, Hau said.
Hau condemned the Philippines over the shooting, and called it a violent act to fire upon an unarmed fisherman. He urged the Philippine government to apologize, release investigation reports and hold those responsible to account.
He also advised the Taiwanese authorities to take a hard stance on the Philippines by halting Philippine-bound tourism, suspending labor imports from the country and increasing fishing protection patrols.
The shooting happened on Thursday morning 164 nautical miles southeast of the southernmost tip of Taiwan, according to the island's coast guard authority.
The victim was identified as Hung Shih-Cheng, 65, one of four crew members of the Taiwanese fishing vessel Guang Ta Hsin 28. Hung's body was taken back to Taiwan early Saturday morning.

我期望得到的结果是这样的:

单词	总次数-文章名:次数-文章名:次数-文章名:次数

如果文章中没有相应的单词的话,就不用出现。

首先,我们需要一个工具类,把每行的String,分解成一个一个的单词:

import java.util.ArrayList;
import java.util.List;

/**
 * @author hadoop
 *
 */
public class StringUtil {
	
	/**
	 * 获得字符串中单词
	 * @param value
	 * @return
	 */
	public static List<String> getWords(String value)
	{
		List<String> wordList = new ArrayList<String>();
		int len = value.length();
		char[] charArray = value.toCharArray();
		char[] word = new char[40];
		int wordIndex = 0;
		for(int i = 0; i < len; i++)
		{
			char c = charArray[i];
			if((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
			{
				word[wordIndex] = c;
				wordIndex++;
			}
			else
			{
				if(wordIndex > 0)
				{
					wordList.add(String.valueOf(word, 0, wordIndex));
					wordIndex = 0;
				}
			}
		}
		return wordList;
	}
}
然后就是我们的mapreduce啦:

import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.MapWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import com.test.util.HdfsFileUtil;
import com.test.util.Prop;
import com.test.util.StringUtil;

/**
 * @author hadoop
 * 
 */
public class Seach {

	private static Log log = LogFactory.getLog(Seach.class);

	
	
	public static class Map extends Mapper<Object, Text, Text, MapWritable> {

		private FileSplit split;
		
		private static IntWritable data = new IntWritable(1);
		
		public void map(Object key, Text value, Context context)
				throws IOException, InterruptedException {
			split = (FileSplit)context.getInputSplit();
			MapWritable map = new MapWritable();
			List<String> wordList = StringUtil.getWords(value.toString());
			int len = wordList.size();
			for(int i = 0; i < len; i++)
			{
				map.put(new Text(split.getPath().getName()), data);
				context.write(new Text(wordList.get(i)), map);
			}
		}
	}

	public static class Reduce extends Reducer<Text, MapWritable, Text, Text> {

		public void reduce(Text key, Iterable<MapWritable> values,
				Context context) throws IOException, InterruptedException {
			int count = 0;
			java.util.Map<String, Integer> countMap = new HashMap<String, Integer>();
			Iterator<MapWritable> iterator = values.iterator();
			while(iterator.hasNext())
			{
				MapWritable curMap = iterator.next();
				String fileName = curMap.keySet().iterator().next().toString();
				if(countMap.containsKey(fileName))
				{
					countMap.put(fileName, countMap.get(fileName) + 1);
				}
				else
				{
					countMap.put(fileName, 1);
				}
				count++;
			}
			int fCount = 0;
			String value = "";
			Iterator<String> it = countMap.keySet().iterator();
			while(it.hasNext())
			{
				if(fCount > 0)
				{
					value += "-";
				}
				String fileName = it.next();
				value += fileName + ":" + countMap.get(fileName);
				fCount++;
			}
			context.write(key, new Text(count + "-" + value));
		}
	}

	/**
	 * @param args
	 */
	public static void main(String[] args) throws Exception {
		String inputPath = "search_in";
    	String outPath = "search_out";
    	Configuration conf = new Configuration();
        conf.set("mapred.job.tracker", Prop.HADOOP_MAPRED_JOB_TRACKER);
        conf.set("fs.default.name", Prop.HDFS_HOST);
        HdfsFileUtil.checkAndDelete(conf, "/" + Prop.HDFS_DIRECTORY + "/" + inputPath);
        HdfsFileUtil.checkAndDelete(conf, "/" + Prop.HDFS_DIRECTORY + "/" + outPath);
    	HdfsFileUtil.upload(conf, Prop.LOCAL_HDFS_DIRECTORY + "/" + inputPath, Prop.HDFS_DIRECTORY + "/" + inputPath);
    	
        Job job = new Job(conf, "search");
        job.setJarByClass(STjoin.class);
        // 设置Map和Reduce处理类
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        // 设置输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(MapWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        // 设置输入和输出目录
        FileInputFormat.addInputPath(job, new Path(inputPath));
        FileOutputFormat.setOutputPath(job, new Path(outPath));
        
        boolean exitStatus = job.waitForCompletion(true);
		HdfsFileUtil.download(conf, "/" + Prop.HDFS_DIRECTORY + "/" + outPath, Prop.LOCAL_HDFS_DIRECTORY + "/" + outPath);
		System.exit(exitStatus ? 0 : 1);
	}
}
大家可以不看HdfsFileUtil类,那只是对HDFS文件系统的一些操作,方便调试。

重点是原理:

map出来的数据如下:

<单词,<文件名,次数>>


reduce就很好理解,只能统计单词的总数,以及在每个文件中的出现的次数,然后再格式化。输出就 OK。

输出的结果文件如下:

An	1-accident.txt:1
As	3-million.txt:2-accident.txt:1
Boat	1-Philippines.txt:1
CHENGDU	1-accident.txt:1
Cheng	1-Philippines.txt:1
China	5-million.txt:3-accident.txt:2
Contiuous	1-million.txt:1
County	1-accident.txt:1
Death	1-accident.txt:1
Dragon	1-Philippines.txt:1
Festival	1-Philippines.txt:1
Flood	1-million.txt:1
Friday	2-million.txt:1-accident.txt:1
Guang	1-Philippines.txt:1
Guizhou	1-accident.txt:1
Hau	3-Philippines.txt:3
He	2-Philippines.txt:2
Heavy	1-million.txt:1
Hsin	1-Philippines.txt:1
Hunan	3-million.txt:3
Hung	2-Philippines.txt:2
It	1-accident.txt:1
Jiangxi	4-million.txt:4
June	1-Philippines.txt:1
Local	1-million.txt:1
Lung	1-Philippines.txt:1
Luxian	1-accident.txt:1
Luzhou	1-accident.txt:1
Monday	1-million.txt:1
NANCHANG	1-million.txt:1
No	1-million.txt:1
On	1-accident.txt:1
Philippine	3-Philippines.txt:3
Philippines	4-Philippines.txt:4
Photo	2-accident.txt:2
Province	4-million.txt:2-accident.txt:2
Rainstorms	2-million.txt:2
Saturday	7-million.txt:3-Philippines.txt:2-accident.txt:2
Shih	1-Philippines.txt:1
Sichuan	1-accident.txt:1
Sixteen	1-accident.txt:1
TAIPEI	1-Philippines.txt:1
Ta	1-Philippines.txt:1
Taipei	2-Philippines.txt:2
Taiwan	2-Philippines.txt:2
Taiwanese	3-Philippines.txt:3
Taozigou	3-accident.txt:3
The	6-million.txt:2-Philippines.txt:3-accident.txt:1
Thursday	2-million.txt:1-Philippines.txt:1
Tuesday	1-million.txt:1
Xinhua	2-accident.txt:2
a	6-million.txt:1-Philippines.txt:3-accident.txt:2
about	2-million.txt:2
accident	3-accident.txt:3
according	2-Philippines.txt:1-accident.txt:1
account	1-Philippines.txt:1
act	1-Philippines.txt:1
advised	1-Philippines.txt:1
affected	2-million.txt:2
after	1-Philippines.txt:1
alarm	1-million.txt:1
allowed	1-Philippines.txt:1
also	5-million.txt:3-Philippines.txt:2
an	2-Philippines.txt:1-accident.txt:1
and	14-million.txt:9-Philippines.txt:3-accident.txt:2
announced	1-Philippines.txt:1
apologize	1-Philippines.txt:1
approaching	1-million.txt:1
are	1-accident.txt:1
around	1-accident.txt:1
as	5-million.txt:4-Philippines.txt:1
at	2-Philippines.txt:1-accident.txt:1
authorities	6-million.txt:3-Philippines.txt:1-accident.txt:2
authority	1-Philippines.txt:1
avoid	1-million.txt:1
back	1-Philippines.txt:1
battered	2-million.txt:2
be	1-Philippines.txt:1
because	1-million.txt:1
been	1-million.txt:1
bin	1-Philippines.txt:1
blast	1-accident.txt:1
body	1-Philippines.txt:1
bound	1-Philippines.txt:1
by	2-Philippines.txt:2
called	1-Philippines.txt:1
casualties	1-million.txt:1
causing	1-million.txt:1
central	1-million.txt:1
citizens	1-million.txt:1
city	2-Philippines.txt:1-accident.txt:1
coal	4-accident.txt:4
coast	2-Philippines.txt:2
colliery	2-accident.txt:2
condemned	1-Philippines.txt:1
control	2-million.txt:2
counties	1-million.txt:1
country	2-Philippines.txt:1-accident.txt:1
crew	1-Philippines.txt:1
crops	1-million.txt:1
damaged	1-million.txt:1
days	1-million.txt:1
dead	1-Philippines.txt:1
destroyed	1-million.txt:1
downpours	1-million.txt:1
drought	2-million.txt:2
due	1-million.txt:1
early	1-Philippines.txt:1
east	1-million.txt:1
eastern	1-million.txt:1
evening	1-accident.txt:1
exchanges	1-Philippines.txt:1
explosion	1-accident.txt:1
fire	1-Philippines.txt:1
fisherman	2-Philippines.txt:2
fishing	2-Philippines.txt:2
five	1-million.txt:1
flood	2-million.txt:2
flooding	1-million.txt:1
floods	1-million.txt:1
forced	1-million.txt:1
four	1-Philippines.txt:1
from	3-million.txt:1-Philippines.txt:1-accident.txt:1
gas	1-accident.txt:1
gone	1-million.txt:1
government	1-Philippines.txt:1
governments	1-million.txt:1
guard	1-Philippines.txt:1
guards	1-Philippines.txt:1
had	1-million.txt:1
halting	1-Philippines.txt:1
happened	1-Philippines.txt:1
hard	1-Philippines.txt:1
has	2-million.txt:2
have	6-million.txt:6
headquarters	3-million.txt:3
heavy	2-million.txt:2
hectares	1-million.txt:1
high	1-million.txt:1
hit	1-million.txt:1
hold	1-Philippines.txt:1
homes	1-million.txt:1
hospitals	1-accident.txt:1
hours	1-accident.txt:1
houses	1-million.txt:1
identified	1-Philippines.txt:1
imports	1-Philippines.txt:1
in	17-million.txt:7-Philippines.txt:2-accident.txt:8
increasing	1-Philippines.txt:1
injured	2-accident.txt:2
inter	1-Philippines.txt:1
into	1-accident.txt:1
investigation	2-Philippines.txt:1-accident.txt:1
is	2-accident.txt:2
island	1-Philippines.txt:1
it	1-Philippines.txt:1
killed	1-accident.txt:1
killing	2-million.txt:2
labor	1-Philippines.txt:1
lakes	2-million.txt:2
landslides	1-million.txt:1
led	1-million.txt:1
levels	2-million.txt:2
lines	1-million.txt:1
local	5-million.txt:2-accident.txt:3
m	1-million.txt:1
major	1-million.txt:1
mayor	1-Philippines.txt:1
members	1-Philippines.txt:1
miles	1-Philippines.txt:1
mine	4-accident.txt:4
miners	2-accident.txt:2
more	1-million.txt:1
morning	2-Philippines.txt:2
nautical	1-Philippines.txt:1
not	1-Philippines.txt:1
occurred	1-accident.txt:1
of	16-million.txt:8-Philippines.txt:5-accident.txt:3
official	1-accident.txt:1
on	8-million.txt:2-Philippines.txt:4-accident.txt:2
one	1-Philippines.txt:1
or	2-million.txt:2
ordered	1-million.txt:1
others	1-accident.txt:1
over	3-million.txt:2-Philippines.txt:1
p	1-million.txt:1
part	1-Philippines.txt:1
past	1-million.txt:1
patrols	1-Philippines.txt:1
people	4-million.txt:4
pm	2-accident.txt:2
potential	1-million.txt:1
prevention	3-million.txt:3
protection	1-Philippines.txt:1
province	2-million.txt:2
provincial	2-million.txt:2
races	1-Philippines.txt:1
rain	3-million.txt:3
rainfall	2-million.txt:2
rainstorms	1-million.txt:1
release	2-million.txt:1-Philippines.txt:1
relocate	1-million.txt:1
relocated	1-million.txt:1
reported	1-million.txt:1
reports	1-Philippines.txt:1
rescued	1-accident.txt:1
reservoirs	2-million.txt:2
residents	1-million.txt:1
responsible	1-Philippines.txt:1
result	1-million.txt:1
rising	1-million.txt:1
risks	1-million.txt:1
rivers	3-million.txt:3
rose	1-accident.txt:1
ruined	1-million.txt:1
s	5-million.txt:1-Philippines.txt:2-accident.txt:2
said	8-million.txt:4-Philippines.txt:1-accident.txt:3
sea	1-Philippines.txt:1
second	1-accident.txt:1
several	2-million.txt:2
shooting	2-Philippines.txt:2
shot	1-Philippines.txt:1
six	2-million.txt:2
sources	1-accident.txt:1
southeast	1-Philippines.txt:1
southern	1-million.txt:1
southernmost	1-Philippines.txt:1
southwest	2-accident.txt:2
stance	1-Philippines.txt:1
started	2-million.txt:2
statement	1-accident.txt:1
strong	1-million.txt:1
suspending	1-Philippines.txt:1
suspension	1-Philippines.txt:1
take	2-Philippines.txt:2
taken	1-Philippines.txt:1
than	1-million.txt:1
the	25-million.txt:10-Philippines.txt:11-accident.txt:4
them	1-accident.txt:1
those	1-Philippines.txt:1
tip	1-Philippines.txt:1
to	15-million.txt:6-Philippines.txt:7-accident.txt:2
toll	1-accident.txt:1
toppled	1-million.txt:1
tourism	1-Philippines.txt:1
treated	1-accident.txt:1
two	1-accident.txt:1
unarmed	1-Philippines.txt:1
underway	1-accident.txt:1
upon	1-Philippines.txt:1
urged	1-Philippines.txt:1
vessel	1-Philippines.txt:1
victim	1-Philippines.txt:1
violent	1-Philippines.txt:1
warned	1-million.txt:1
was	3-Philippines.txt:3
water	2-million.txt:2
well	1-million.txt:1
were	4-million.txt:1-accident.txt:3
which	1-million.txt:1
will	1-Philippines.txt:1
with	1-Philippines.txt:1
符合当初的设计。Very Good。



作者:limiteeWALTWO 发表于2013-5-13 23:50:52 原文链接
阅读:135 评论:0 查看评论

相关 [hadoop 倒排索引] 推荐:

hadoop倒排索引

- - CSDN博客云计算推荐文章
看到很多的hadoop关于倒排索引的例子,但是我想写一个属于我自己的,加入了本人对于hadoop中mapreduce的理解. CHENGDU - Death toll from a colliery blast on Saturday in southwest China's Sichuan Province rose to 27, local authorities said.

基于hadoop的mapreduce实现倒排索引

- - ITeye博客
基于 hadoop 的 mapreduce 实现倒排索引. 倒排索引(英语: Inverted index ),也常被称为反向索引、置入档案或反向档案,是一种索引方法,被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射. 它是文档检索系统中最常用的数据结构.

倒排索引

- - ITeye博客
倒排索引是文档检索系统中最常见的数据结构,被广泛的应用于搜索引擎. 它是一种根据内容查找文档的方式. 由于不是根据文档来找内容,而是根据进行了相反的操作,因此叫做倒排索引. 倒排索引的一个简单结构如下图所示:. 最常见的是使用词频作为权重,即单词在一个文档中出现的次数. 因此,当搜索条件为“MapReduce”“is”“simple”的时候,对应的集合为{(0.txt,1),(1.txt,1),(2.txt,2)}且{(0.txt,1),(1.txt,2)}且{(0.txt,1),(1.txt,1)}={0.txt,1.txt}.

ElasticSearch 倒排索引、分词

- - 行业应用 - ITeye博客
es使用称为倒排索引的结构达到快速全文搜索的目的. 一个倒排索引包含一系列不同的单词,这些单词出现在任何一个文档,. 对于每个单词,对应着所有它出现的文档. 比如说,我们有2个文档,每个文档有一个conteng字段. 我们首先对每个字段进行分词,我们称之为terms或者tokens,创建了一些列有序列表,.

MapReduce 编程之 倒排索引

- - CSDN博客云计算推荐文章
本文调试环境: ubuntu 10.04 , hadoop-1.0.2. hadoop装的是伪分布模式,就是只有一个节点,集namenode, datanode, jobtracker, tasktracker...于一体. 本文实现了简单的倒排索引,单词,文档路径,词频,重要的解释都会在代码注视中.

MapReduce案例之倒排索引

- - 行业应用 - ITeye博客
1       倒排索引. "倒排索引"是文档检索系统中最常用的数据结构,被广泛地应用于全文搜索引擎. 它主要是用来存储某个单词(或词组)在一个文档或一组文档中的存储位置的映射,即提供了一种根据内容来查找文档的方式. 由于不是根据文档来确定文档所包含的内容,而是进行相反的操作,因而称为倒排索引(Inverted Index).

搜索引擎-倒排索引基础知识

- - CSDN博客推荐文章
       单词-文档矩阵是表达两者之间所具有的一种包含关系的概念模型,图3-1展示了其含义. 图3-1的每列代表一个文档,每行代表一个单词,打对勾的位置代表包含关系.                                                                  图3-1 单词-文档矩阵.

Hadoop Streaming 编程

- - 学着站在巨人的肩膀上
Hadoop Streaming是Hadoop提供的一个编程工具,它允许用户使用任何可执行文件或者脚本文件作为Mapper和Reducer,例如:. 采用shell脚本语言中的一些命令作为mapper和reducer(cat作为mapper,wc作为reducer). 本文安排如下,第二节介绍Hadoop Streaming的原理,第三节介绍Hadoop Streaming的使用方法,第四节介绍Hadoop Streaming的程序编写方法,在这一节中,用C++、C、shell脚本 和python实现了WordCount作业,第五节总结了常见的问题.

Hadoop使用(一)

- Pei - 博客园-首页原创精华区
Hadoop使用主/从(Master/Slave)架构,主要角色有NameNode,DataNode,secondary NameNode,JobTracker,TaskTracker组成. 其中NameNode,secondary NameNode,JobTracker运行在Master节点上,DataNode和TaskTracker运行在Slave节点上.