Lab – Working with MapReduce

Topic Progress:

We would write a MapReduce Program using mapreduce program template with mapper, reduce, and a main function. You only need to make necessary changes in these modules as per your data needs.

In this lab we shall work on bigdatafile-dihub.txt file which contains a large set of numbers from 0 to 99. Our objective is to count the number of occurrences of each word (digit) in this document.

Note: In counting occurrences we treat each digit as a word, so further in this document we will treat the problem statement as word count problem.

Note: We will be using bigdatafile-dihub.txt file from hdfs as an input file

1. Let’s check whether the file exists on hdfs of not by using the command hdfs fs -ls /

2. Go to Eclipse IDE and create a new project. Create package name hadoop and java program named Wordcount.java

3. You need to add mapreduce jars files to your eclipse project to use mapreduce packages. As we are working on hadoop 2.8.4, the jars files needed for mapreduce programming are:

·         hadoop-common-2.8.4.jar

·         hadoop-hdfs-2.8.4.jar

·         hadoop-mapreduce-client-core-2.8.4.jar

4. Right click on package and select Build Path and then Configure Build Path

5. Now click on Libraries tab and then on add external jars.

6. Now browse the mentioned files from Downloads directory and click on finish.

7. Create code in Wordcount mapreduce program by importing dependencies & write the mapper function

Note: You can get this complete program code inside javaMR directory in Documents directory of your user.

8. Write corresponding reducer function and configure the job to run the program

9. Now create jar file for this program. To make jar follow the following steps:

10. Right-click on program in project explorer window of eclipse and choose export

11. Select jar under java option click next and give the path where you wish to save the jar file as shown:

12. Click next and finish making jar. Once you have created jar successfully you just need to run it over Hadoop environment.

13.   Now run this jar on hdfs using terminal. Jar files can be run using hadoop jar command. You can run the jar by giving jar file name with complete class address (package.class) followed by input file path and output file path on hdfs

hadoop jar wc.jar hadoop.wordcount /bigdatafile-dihub.txt /wc-output

Note: Once the program execution is over, you will find your result in mentioned output directory on HDFS.                   

14.   Accessing results from hdfs – our result is stored in output directory on hdfs which we named as wc-output. You can check its content on GUI

15. To see the content of this result file you can download it to the local host and open with any text editor. Or you can also check the content using terminal command as shown:

hadoop dfs -cat /wc-output/part-r-00000

                        

As per our reducer function our output is shown as the word followed by its number of occurrences in the document. 

The result is shown in lexicographical ordering. In the next lab exercise we will write a program to see results in numerical ordering.