• Post author:
  • Post category:Hadoop
  • Post last modified:March 27, 2024
  • Reading time:4 mins read

Once you have Apache Hadoop Installation completes and able to run HDFS commands, the next step is to do Hadoop Yarn Configuration on Cluster. This post explains how to setup Yarn master on the Hadoop cluster and run a map-reduce example.

Advertisements

Before you proceed with this document, please make sure you have Apache Hadoop Installation and the Hadoop cluster is up and running.

By default Yarn comes with Hadoop distribution hence there is no need of additional installation, just you need to configure to use Yarn and some memory/core settings.

1. Configure yarn-site.xml

On yarn-site.xml file, configure default node manager memory, yarn scheduler minimum, and maximum memory configurations.

<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>1536</value>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>1536</value>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>128</value>
</property>
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

2. Configure mapred-site.xml file

Add below properties to mapred-site.xml file

<property>
	<name>yarn.app.mapreduce.am.resource.mb</name>
	<value>512</value>
</property>
<property>
	<name>mapreduce.map.memory.mb</name>
	<value>256</value>
</property>
<property>
	<name>mapreduce.reduce.memory.mb</name>
	<value>256</value>
</property>
<property>
	<name>yarn.app.mapreduce.am.env</name>
	<value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value>
</property>
<property>
	<name>mapreduce.map.env</name>
	<value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value>
</property>
<property>
	<name>mapreduce.reduce.env</name>
	<value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value>
</property>

3. Configure Data Nodes

Copy yarn-site.xml and mapred-site.xml files to all 3 data nodes (I have 3 data nodes)

Below is an example to copy to datanode1 using the SCP command. repeat this setup for all your data nodes.

Naveen

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn LinkedIn