Hadoop MapReduce FileCounting

This file counting program prints the words (case-insensitive with no special characters or numbers) with the name of the file it was in and the number of times it occurs in that file, and then the total number of files that word appears in.

Local (Standalone) Mode

Setting up Hadoop

Linux System/ Virtual Machine
Java Must be installed in the system.
ssh, sshd and rsync must be installed.

Install Hadoop

Setting Path Names by performing the following commands:

echo $JAVA_HOME
echo $HADOOP_CLASSPATH
# get default Java path
readlink -f /usr/bin/java | sed "s:bin/java::"
# Output for me was:
# /usr/lib/jvm/java-11-openjdk-amd64/
# Then add these lines to the hadoop-env.sh or set them in terminal
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar

Then try the following command:

$ bin/hadoop

Configurations, open the following files to configure them:

$ gedit etc/hadoop/core-site.xml

In core-site.xml

<configuration>
    <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
    </property>
</configuration>

In hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Check ssh to localhost

$ ssh localhost

Format the filesystem

$ bin/hdfs namenode –format

Run the daemons (This part was tricky for me)

$ sbin/start-dfs.sh
# The daemons can be stopped by typing the following command
$ sbin/stop-dfs.sh

Setting up and running a simple Hadoop job

Executing the program by performing the following commands:

$ bin/hadoop com.sun.tools.javac.Main FileCounting.java
$ jar cf filecounting.jar FileCounting*.class
# Copy input/file01, input/file02, and input/file02 of this project and place them inside the input folder of the Hadoop distrubution folder.
$ bin/hadoop jar filecounting.jar FileCounting /user/hadoop/filecount/input /user/hadoop/filecount/output
# And finally to see the output, run the below command:
$ bin/hadoop dfs -cat /user/hadoop/filecount/output/part*

This simple Hadoop job, gets three text files from the "input" folder, howerver you could put more or less.

But first you need to make a input folder:

bin/hdfs dfs –mkdir /user/hadoop/filecount/input

Then you need to copy the files from its directory like a folder on desktop, to the hadoop input directory:

bin/hadoop fs -copyFromLocal ../filecounter/input/file0* /user/hadoop/filecount/input

The following commands prints the text output to terminal:

#file01
$ bin/hadoop dfs -cat /user/hadoop/filecount/input/file01
Hello World BYE World

#file02
$ bin/hadoop dfs -cat /user/hadoop/filecount/input/file02
Hello World Hadoop Goodbye World HADOOP

#file03
$ bin/hadoop dfs -cat /user/hadoop/filecount/input/file03
Hello hello hello HELLO hello! 12345

And by submitting a Hadoop job, the java code has a mapper class that maps input from the files with the words and the files they appear in (from the setup method that uses FileSplit), then there is a semi-reducer class that counts the number of times the word appears in the particular file. Then, applying the Reduce step to attach the string to and the value of the counter that displays in output the total # of files the word appears in. It generates the output an below:

(There is a # in the output, so that is not a comment like it looks like in GitHub.)

bye    file01: 1,  The total # of files this word appears in is: 1
goodbye    file02: 1,  The total # of files this word appears in is: 1
hadoop    file02: 2,  The total # of files this word appears in is: 1
hello    file02: 1, file03: 5, file01: 1,  The total # of files this word appears in is: 3
world    file01: 2, file02: 2,  The total # of files this word appears in is: 2

When reruning this Hadoop code, please make sure you go into the hadoop folder and delete the jar file and anything else related to the program like the '.class' file.

# Remove the output before re-running the code.
$ bin/hadoop fs -rm /user/hadoop/filecount/output/*
$ bin/hadoop fs -rmdir /user/hadoop/filecount/output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop MapReduce FileCounting

Local (Standalone) Mode

Setting up Hadoop

Setting up and running a simple Hadoop job

References

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
README.md		README.md
file01		file01
file02		file02
file03		file03

AwesomePaul100/Hadoop

Folders and files

Latest commit

History

Repository files navigation

Hadoop MapReduce FileCounting

Local (Standalone) Mode

Setting up Hadoop

Setting up and running a simple Hadoop job

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages