Working with MapReduce

Topic Progress:

MapReduce works on key-value pairs of distributed datasets. The best part about MapReduce processing is that it can be scaled out to multiple nodes on commodity hardware.

MapReduce execution is completely different from conventional data processing where we move data from storage location to processing unit.

However in MapReduce execution model we move processing component (code) towards data. This is done by Hadoop framework itself without any explicit notification to the user.

Due to movement of code nearest to the data location, storage nodes and compute nodes are same in mapreduce. It gives significant performance improvement specially in case of large data stored in distributed systems.

For a mapreduce program you need to code a mapper to address all data lines spread over datanodes for specified input directory/file. By default, a mapreduce program works on lines of files in parallel. Whatever we program in a mapper, applies on each line of file in parallel.

You need to give input key-value pair as well as output key-value pair for a mapper. The output of mapper function works as an intermediate key-value pair which is further fed into reducer for processing.  

Once the mapper processing is over, it writes its result to the local disk. Now you need to program reducer to take input from all mappers and process it to finalize the result and return it to application.

Finally, you need to code a main module to call all the modules with their key-value types and input output configurations. Key-value configurations play a vital role in mapreduce job execution.

A mapper converts input key-value pair into intermediate key-value pairs. A reducer takes these intermediate key-value pairs and gives final key-value pair as a result.

In MapReduce all modules share and process key-value pairs. As the key value pairs may need to communicate among cluster nodes, key and value classes must be serializable. For the same mapreduce provides a writable interface, otherwise the programmer may need serialization support from java which is time and resource consuming effort.

So, all the value types in mapreduce needs to implement Writable interface. Additionally, the key class needs to implement WritableComparable interface to facilitate sorting in keys. Any set of value belonging to same key goes to create a single entry in output of MapReduce program. Key-value pair plays a vital role in Mapreduce execution.

The output key value pair in mapper and reducer may be totally different of each other. Although output of mapper is always to be taken as input to reducer as it is.

Valid MapReduce Key types which implements writable Comparable interface are

Object
LongWritable
IntWritable
Text
BytesWritable
FloatWritable
DoubleWritable
ShortWritable
NullWritable

To understand all these MapReduce terminologies, we need to add mapreduce specific jar files to our project while coding in any IDE. We shall discuss about needed jar files in lab document.

Let’s take an example using Java code (we would hands-on during lab time) for writing a mapper function named WordcountMapper which extends Mapper class of Mapreduce framework.

Now we need to specify input and output key-value pairs in the function. In this program we are taking input key to be object, input value to be text, output key of mapper to be Text, and output value of mapper function to be IntWritable.

https://lh6.googleusercontent.com/epUY_XRv_0R4H_r6Td7iyJonC76Wg2I70VpiH3zXYewqwXbTOcRIHYoXkUeOkqz_R3lIjNDdhO39-f-45PVtdJQfMUshFcXrQS1WemdbfB6dnycercYUNLqraXP38AvMJEfwaWQv

In the program we write our reducer named WordcountReducer which extends Reducer class and have input output key-value types to be Text and IntWritable which are same as output key-value pair of mappers.

In the main function we need to create job configuration by specifying the parameters for mapper, reducer, output key value classes and some other internal functions like combiner which we will be discussing in detail in advanced MapReduce course.