Tuesday, September 10, 2013

Important things about TaskTracker and mapred-site.xml configuration....

 
Tasks in TaskTracker: 
 
    For each input split, a map task is created that runs the user-supplied map function on
each record in the split. Map tasks are executed in parallel. This means each chunk of
the input dataset is being processed at the same time by various machines that make
up the cluster. It’s fine if there are more map tasks to execute than the cluster can handle.
They’re simply queued and executed in whatever order the framework deems best.
The map function takes a key-value pair as input and produces zero or more intermediate key-value pairs.
The input format is responsible for turning each record into its key-value pair representation.
 
 
 
There is always a single tasktracker on each worker node.
Both tasktrackers and datanodes run on the same machines, which makes each node
both a compute node and a storage node, respectively. Each tasktracker is configured
with a specific number of map and reduce task slots that indicate how many of each
type of task it is capable of executing in parallel. A task slot is exactly what it sounds
like; it is an allocation of available resources on a worker node to which a task may be
assigned, in which case it is executed. A tasktracker executes some number of map
tasks and reduce tasks in parallel, so there is concurrency both within a worker where
many tasks run, and at the cluster level where many workers exist. Map and reduce
slots are configured separately because they consume resources differently
It is common that tasktrackers allow more map tasks than reduce tasks to execute in parallel.
Upon receiving a task assignment from the jobtracker, the tasktracker executes an
attempt of the task in a separate process.
 
Upon receiving a task assignment from the jobtracker, the tasktracker executes an attempt 
of the task in a separate process.
 
Difference between Task and Task attempt (Task instance):
 
- A task is the logical unit of work, while a task attempt is a specific, physical instance 
of that task being executed.
 


mapred-site.xml
 
<property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>4</value>
    <description>The maximum number of map tasks that will be run simultaneously by a task tracker.</description>
</property>

<property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>4</value>
    <description>The maximum number of reduce tasks that will be run simultaneously by a task tracker.</description>
</property>

No comments:

Post a Comment