Hadoop 1.0 Limitations
Prerequisite:
You can understand this article in a better manner if you have basic knowledge of Hadoop and MapReduce.- It limits scalability: JobTracker runs on single machine doing several task like
- Resource management
- Job and task scheduling and
- Monitoring
- Availability Issue: In Hadoop 1.0, JobTracker is single Point of failure(SPOF). This means if JobTracker fails, all jobs must restart.
- Problem with Resource Utilization: In Hadoop 1.0, there is concept of predefined number of map slots and reduce slots for each TaskTrackers. Resource Utilization issues occur because maps slots might be ‘full’ while reduce slots is empty (and vice-versa). Here the compute resources (DataNode) could sit idle which are reserved for Reduce slots even when there is immediate need for those resources to be used as Mapper slots.
- Limitation in running non-MapReduce Application: In Hadoop 1.0, Job tracker was tightly integrated with MapReduce and only supporting application that obeys MapReduce programming framework can run on Hadoop.
Let’s try to understand point 4 in more detail.
Introduction of new YARN layer in Hadoop 2.0:
YARN (Yet Another Resource Negotiator) is a new component added in Hadoop 2.0Let’s have a look on how Hadoop architecture has changed from Hadoop 1.0 to Hadoop 2.0
As shown, in Hadoop 2.0 a new layer has been introduced between HDFS and MapReduce.This is YARN framework which is responsible for doing Cluster Resource Management.
"The fundamental idea behind YARN is split the functionality of Job Tracker and Task Tracker into separate Entities."
Cluster Resource Management:
-------------------------------------------------------------------------------------------------------------Cluster resource management means managing the resources of the Hadoop Clusters. And by resources we mean Memory, CPU etc.
YARN took over this task of cluster management from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best.
In Hadoop 1.0, there is tight coupling between Cluster Resource Management and MapReduce programming model.
Job Tracker, which does resource management, is part of, MapReduce Framework.
In MapReduce framework, MapReduce job (MapReduce application) is divided between number of tasks called mappers and reducers. Each task runs on one of the machine (DataNode) of the cluster, and each machine has a limited number of predefined slots (map slot, reduce slot) for running tasks concurrently.
Here, JobTracker is responsible for both managing the cluster's resources and driving the execution of the MapReduce job. It reserves and schedules slots for all tasks, configures, runs and monitors each task, and if a task fails, it allocates a new slot and reattempts the task. After a task finishes, the job tracker cleans up temporary resources and releases the task's slot to make it available for other jobs.
"The fundamental idea behind YARN is split the functionality of Job Tracker and Task Tracker into separate Entities."
Cluster Resource Management:
-------------------------------------------------------------------------------------------------------------Cluster resource management means managing the resources of the Hadoop Clusters. And by resources we mean Memory, CPU etc.
YARN took over this task of cluster management from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best.
Cluster Resource Management in Hadoop 1.0:
In Hadoop 1.0, there is tight coupling between Cluster Resource Management and MapReduce programming model.
Job Tracker, which does resource management, is part of, MapReduce Framework.
In MapReduce framework, MapReduce job (MapReduce application) is divided between number of tasks called mappers and reducers. Each task runs on one of the machine (DataNode) of the cluster, and each machine has a limited number of predefined slots (map slot, reduce slot) for running tasks concurrently.
Here, JobTracker is responsible for both managing the cluster's resources and driving the execution of the MapReduce job. It reserves and schedules slots for all tasks, configures, runs and monitors each task, and if a task fails, it allocates a new slot and reattempts the task. After a task finishes, the job tracker cleans up temporary resources and releases the task's slot to make it available for other jobs.