第3章 MapReduce基础实战(文件合并)

任务描述

本关任务：使用Map/Reduce编程实现文件合并和去重操作。
相关知识

通过上一小节的学习我们了解了MapReduce大致的使用方式，本关我们来了解一下Mapper类，Reducer类和Job类。

map类

首先我们来看看Mapper对象：
在这里插入图片描述

在编写MapReduce程序时，要编写一个类继承Mapper类，这个Mapper类是一个泛型类型，它有四个形参类型，分别指定了map()函数的输入键，输入值，和输出键，输出值的类型。就第一关的例子来说，输入键是一个长整型，输入值是一行文本，输出键是单词，输出值是单词出现的次数。
在这里插入图片描述

Hadoop提供了一套可优化网络序列化传输的基本类型，而不是直接使用Java内嵌的类型。这些类型都在org.apache.hadoop.io包中，这里使用LongWritable（相当于Java中的Long类型），Text类型（相当于Java中的String类型）和IntWritable（相当于Integer类型）。

map()函数的输入是一个键和一个值，我们一般首先将包含有一行输入的text值转换成Java的String类型，然后再使用对字符串操作的类或者其他方法进行操作即可。

Reducer类

同样Reducer也有四个参数类型用于指定输入和输出类型，reduce()函数的输入类型必须匹配map函数的输出类型，即Text类型和IntWritable类型，在这种情况下，reduce函数的输出类型也必须是Text和IntWritable类型，即分别输出单词和次数。
在这里插入图片描述

Job类

在这里插入图片描述

一般我们用Job对象来运行MapReduce作业，Job对象用于指定作业执行规范，我们可以用它来控制整个作业的运行，我们在Hadoop集群上运行这个作业时，要把代码打包成一个JAR文件（Hadoop在集群上发布的这个文件），不用明确指定JAR文件的名称，在Job对象的setJarByClass()函数中传入一个类即可，Hadoop利用这个类来查找包含他的JAR文件。addInputPath()函数和setOutputPath()函数用来指定作业的输入路径和输出路径。值的注意的是，输出路径在执行程序之前不能存在，否则Hadoop会拒绝执行你的代码。

最后我们使用waitForCompletion()方法提交代码并等待执行，该方法唯一的参数是一个布尔类型的值，当该值为true时，作业会把执行过程打印到控制台，该方法也会返回一个布尔值，表示执行的成败。
编程要求

接下来我们通过一个练习来巩固学习到的MapReduce知识吧。

对于两个输入文件，即文件file1和文件file2，请编写MapReduce程序，对两个文件进行合并，并剔除其中重复的内容，得到一个新的输出文件file3。
为了完成文件合并去重的任务，你编写的程序要能将含有重复内容的不同文件合并到一个没有重复的整合文件，规则如下：

1
2
3
4

第一列按学号排列；
学号相同，按x,y,z排列；
输入文件路径为：/user/tmp/input/；
输出路径为：/user/tmp/output/。

注意：输入文件后台已经帮你创建好了，不需要你再重复创建。
测试说明

程序会对你编写的代码进行测试：
输入已经指定了测试文本数据：需要你的程序输出合并去重后的结果。
下面是输入文件和输出文件的一个样例供参考。

输入文件file1的样例如下：
20150101 x
20150102 y
20150103 x
20150104 y
20150105 z
20150106 x

输入文件file2的样例如下：
20150101 y
20150102 y
20150103 x
20150104 z
20150105 y

根据输入文件file1和file2合并得到的输出文件file3的样例如下：

20150101 x
20150101 y
20150102 y
20150103 x
20150104 y
20150104 z
20150105 y
20150105 z
20150106 x

实现代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class Merge {
/**
* @param args
* 对A,B两个文件进行合并，并剔除其中重复的内容，得到一个新的输出文件C
*/
//在这重载map函数，直接将输入中的value复制到输出数据的key上注意在map方法中要抛出异常：throws IOException,InterruptedException
/********** Begin **********/
public static class Map extends Mapper<LongWritable, Text, Text, Text >
{
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
String str = value.toString();
String[] data = str.split(" ");
Text t1= new Text(data[0]);
Text t2 = new Text(data[1]);
context.write(t1,t2);
}
}
/********** End **********/
//在这重载reduce函数，直接将输入中的key复制到输出数据的key上注意在reduce方法上要抛出异常：throws IOException,InterruptedException
/********** Begin **********/
public static class Reduce extends Reducer<Text, Text, Text, Text>
{
protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
List<String> list = new ArrayList<>();
for (Text text : values) {
String str = text.toString();
if(!list.contains(str)){
list.add(str);
}
}
Collections.sort(list);
for (String text : list) {
context.write(key, new Text(text));
}
}
/********** End **********/
}
public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();
Job job = new Job(conf, "word count");
job.setJarByClass(Merge.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
String inputPath = "/user/tmp/input/"; //在这里设置输入路径
String outputPath = "/user/tmp/output/"; //在这里设置输出路径
FileInputFormat.addInputPath(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}