-
Notifications
You must be signed in to change notification settings - Fork 0
/
WordCount_B1.txt
157 lines (117 loc) · 4.26 KB
/
WordCount_B1.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
## WordCount.java
```java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
```
[WordCount.java](https://prod-files-secure.s3.us-west-2.amazonaws.com/bd495344-932e-402f-8a70-bc70825042ed/2d12b7c2-ecaf-46dd-ba56-aebc4ff78fce/WordCount.java)
# **Steps:**
1. Make sure Hadoop is installed and running:
```
**hadoop version
javac -version**
```
1. (Make folder **WordCountTutorial** folder and make WordCount.java file in which code is written
2. Now, create new folder for input_data.
Add your own text file(input.txt) in this folder.
3. Add words as many as can.
4. Create new folder(tutorial_classes) to hold the java class files
5. Now,set HADOOP_CALSSPATH enviorment variable.
```
**export HADOOP_CLASSPATH=$(hadoop classpath)**
```
1. Make sure it has been set correctly:
```
**echo $HADOOP_CLASSPATH**
```
1. Create a directory on HDFS
```
**hadoop fs -mkdir <DIRECTORY_NAME>
hadoop fs -mkdir /WordCountTutorial(folder name)**
```
1. And create directory inside it for the input
```
**hadoop fs -mkdir <HDFS_INPUT_DIRECTORY>
hadoop fs -mkdir /WordCountTutorial/Input**
```
1. Upload the input file to that directory:
```
**hadoop fs -put <INPUT_FILE> <HDFS_INPUT_DIRECTORY>
hadoop fs -put ‘INPUT_FILE’ /WordCountTurorial/Input**
```
1. Change the current directory to the tutorial directory
```
**cd <TUTORIAL_DIRECTORY>
cd /home/sheela/Desktop/WordCountTutorial**
```
1. Compile the java code:
```
**javac -classpath ${HADOOP_CLASSPATH} -d <CLASS_FOLDER> <TUTORIAL_JAVA_FILE>
javac -classpath ${HADOOP_CLASSPATH} -d ‘tutorialclasses folder path’ ‘WordCount.java file path’**
```
1. 3 Files .class are created in the tutorial_classes folder
2. Put the output files in one jar file:
```
**jar -cvf <JAR_FILE_NAME> -C <CLASSES_FOLDER> .
jar -cvf firstTutorial.jar -C tutorial_classes/ .**
```
1. Run jar file on hadoop
```
**hadoop jar <JAR_FILE> <CLASS_NAME> <HDFS_INPUT_DIRECTORY> <HDFS_OUTPUT_DIRECTORY>
hadoop jar ‘jar file location’ WordCount /WordCountTutorial/Input /WordCountTutorial/Output**
```
1. Output:
```
**hadoop dfs -cat <HDFS_OUTPUT_DIRECTORY>*
hadoop dfs -cat /WordCountTutorial/Output/***
```