Tuesday, 13 September 2016

Why YARN is Needed?

Hadoop 1.0 Limitations

Prerequisite:
You can understand this article in a better manner if you have basic knowledge of Hadoop and MapReduce.


  1. It limits scalability: JobTracker runs on single machine doing several task like
    • Resource management
    • Job and task scheduling and
    • Monitoring
    Although there are so many machines (DataNode) available; they are not getting used. This limits scalability.
  2. Availability Issue: In Hadoop 1.0, JobTracker is single Point of failure(SPOF). This means if JobTracker fails, all jobs must restart.
  3. Problem with Resource Utilization: In Hadoop 1.0, there is concept of predefined number of map slots and reduce slots for each TaskTrackers. Resource Utilization issues occur because maps slots might be ‘full’ while reduce slots is empty (and vice-versa). Here the compute resources (DataNode) could sit idle which are reserved for Reduce slots even when there is immediate need for those resources to be used as Mapper slots.
  4. Limitation in running non-MapReduce Application: In Hadoop 1.0, Job tracker was tightly integrated with MapReduce and only supporting application that obeys MapReduce programming framework can run on Hadoop.
    Let’s try to understand point 4 in more detail. 

Introduction of new YARN layer in Hadoop 2.0:

YARN (Yet Another Resource Negotiator) is a new component added in Hadoop 2.0 
Let’s have a look on how Hadoop architecture has changed from Hadoop 1.0 to Hadoop 2.0 


As shown, in Hadoop 2.0 a new layer has been introduced between HDFS and MapReduce.This is YARN framework which is responsible for doing Cluster Resource Management.

"The fundamental idea behind YARN is split the functionality of Job Tracker and Task Tracker into separate Entities."


Cluster Resource Management:
-------------------------------------------------------------------------------------------------------------Cluster resource management means managing the resources of the Hadoop Clusters. And by resources we mean Memory, CPU etc. 

YARN took over this task of cluster management from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best. 







Cluster Resource Management in Hadoop 1.0:

In Hadoop 1.0, there is tight coupling between Cluster Resource Management and MapReduce programming model
Job Tracker, which does resource management, is part of, MapReduce Framework.





In MapReduce framework, MapReduce job (MapReduce application) is divided between number of tasks called mappers and reducers. Each task runs on one of the machine (DataNode) of the cluster, and each machine has a limited number of predefined slots (map slot, reduce slot) for running tasks concurrently. 

Here, JobTracker is responsible for both managing the cluster's resources and driving the execution of the MapReduce job. It reserves and schedules slots for all tasks, configures, runs and monitors each task, and if a task fails, it allocates a new slot and reattempts the task. After a task finishes, the job tracker cleans up temporary resources and releases the task's slot to make it available for other jobs. 




Hadoop 2.0 Cluster Architecture

Hadoop 2.0 Cluster Architecture Federation


1)  In Hadoop 1.0., the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode and if there was some kind of failure at this point, the cluster would be unavailable until the NameNode was either restarted.



2) In HADOOP 2.0, the Resource Manager takes over the Job Tracker and the Node Manager takes over the Task Tracker. 
3) The HDFS High Availability feature in Hadoop 2.0 addresses the above problem by providing a work around of running two redundant NameNodes in the same cluster. This allows a fast failover to a new NameNode in the case there is a failure. 
4) There are two NameNodes in HA: Active and Standby NameNode. The Data nodes send the block report to both these NameNodes. Any changes made is then updated in the shared edit logs and the standby NameNode periodically reads the edit logs but the writing process in the edit log is done only by the active NameNode. This forms the concept of fencing. In case of any failure at active NameNode, the standby NameNode takes over and becomes the primary NameNode.