MapReduce is a batch processing framework which is designed to works in parallel on huge amount of data distributed over the cluster.
It consists of two phases named “Mapper” and “Reducer” which works on the concept of key-value pair. For any data which you want to be processed in MapReduce model, needs to have well-defined key and value items.
In Hadoop 2.x we have Mapreduce 2.0 that is also known as YARN.
We need to understand MapReduce 1.0 and MapReduce 2.0 separately.
MapReduce 1.0 has two main components named jobtracker and tasktracker which are responsible for all aspects of job execution.
When client sends any request to execute a mapreduce program, a job tracker is started at Namenode which works as a coordinator for all task trackers which works on different datanodes where the actual file blocks reside.
MapReduce program works on small splits of a large data input file residing on different data nodes which are processed by mapper in completely parallel manner.
Mapper runs on datanodes in parallel and does the local processing on data residing on those nodes.
Mapper generally contains the code like filter, conversion and selection. Once the mapper’s task is complete at datanode, it writes its results to the local file system.
The tasktracker will be in constant communication with jobtracker to trace and give the progress of any task executed, delayed, or failed at the datanode.
Once the mapper is complete, reducers starts its work according to user specifications and returns the results to client in desired format.
MapReduce also handles component failure and recovery to deal with any unexpected situation. As jobtracker keeps track of all tasktracker’s status – if any one of these goes down or responds too slow – jobtracker starts a replica process on any other node having the same data block.
Now the one tasktracker which responds first to the jobtracker with results is being considered as active process. Thus, as a whole, the client program is unaware and unaffected of any failure at task level.
In Hadoop 2.x we have MapReduce 2.0 which consists of two components YARN and MapReduce.
As a major improvement over its previous version, MapReduce 2.0 focuses on keeping cluster management facilities separate from mapreduce logic. The cluster management activities are now governed by YARN while mapreduce job execution is kept as a separate component in MapReduce 2.0
If we compare mapreduce-2.0 directly with mapreduce-1.0 architecture, we find that
Jobtracker is replaced by Resource Manager and Application Master components
Nodemanager is the replacement of tasktracker.