• Post author:
  • Post category:Hadoop
  • Post last modified:March 27, 2024
  • Reading time:16 mins read

Below are the steps of Apache Hadoop Installation on a Linux Ubuntu server, if you have a windows laptop with enough memory, you can create 4 virtual machines by using Oracle Virtual Box and install Ubuntu on these VM’s. This article assumes you have Ubuntu OS running and doesn’t explain how to create VM’s and install Ubuntu.

Apache Hadoop is an open-source distributed storing and processing framework that is used to execute large data sets on commodity hardware; Hadoop natively runs on Linux operating system, in this article I will explain step by step Apache Hadoop installation version (Hadoop 3.1.1) on a multi-node cluster on Ubuntu (one name node and 3 data nodes).

Hadoop has two main components storage and processing; HDFS is used as a distributed storage and MapReduce/Yarn is used for processing.

HDFS is a Distributed File System store in Hadoop which is a Master-Slave architecture.

Name Node is a master node of HDFS that maintains the metadata of the files that are stored in the Data Node (Slave Nodes). Name Node is a single point of failure on HDFS, meaning failure of Name node brings down the entire Hadoop cluster hence, Name Node ideally runs on powerful commercial hardware. Note that on Name Node “namenode” Java service would be running.

Data Node is where the actual data of files are stored in HDFS, these are slave nodes and runs on commodity hardware. Note that on Data node servers “datanode” Java service would be running.

Here are the 4 nodes and their IP addresses I will be referring to in this Apache Hadoop Installation on Linux Ubuntu article. and, my login user is ubuntu.

IP AdressHost Name
192.168.1.100namenode
192.168.1.141datanode1
192.168.1.113datanode2
192.168.1.118datanode3

1. Apache Hadoop Installation – Preparation

In this section, I will explain how to Installation of some pre-requisites software’s, libraries and configurations of Hadoop on Linux Ubuntu OS. Follow and complete the below steps on all Nodes of a cluster.

1.1 Update the Source List of Ubuntu

First, update the ubuntu source list before we start Installing Apache Hadoop.


sudo apt-get update

1.2 Install SSH

If you don’t have Secure Shell (SSH), install SSH on server.


sudo apt-get install ssh

1.3 Setup Passwordless login Between Name Node and all Data Nodes.

The name node will use an ssh-connection to connect to other nodes in a cluster with key-pair authentication, to manage the cluster. hence let’s generate key-pair using ssh-keygen


ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

ssh-keygen command creates below files.


ubuntu@namenode:~$ ls -lrt .ssh/
-rw-r--r-- 1 ubuntu ubuntu 397 Dec 9 00:17 id_rsa.pub
-rw------- 1 ubuntu ubuntu 1679 Dec 9 00:17 id_rsa

Copy id_rsa.pub to authorized_keys under ~/.ssh folder. By using >> it appends the contents of the id_rsa.pub file to authorized_keys


cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now copy authorized_keys to all data nodes in a cluster. This enables name node to connect to data nodes password less (without prompting for password)


scp .ssh/authorized_keys datanode1:/home/ubuntu/.ssh/authorized_keys
scp .ssh/authorized_keys datanode2:/home/ubuntu/.ssh/authorized_keys
scp .ssh/authorized_keys datanode3:/home/ubuntu/.ssh/authorized_keys

1.4 Add all our nodes to /etc/hosts.


sudo vi /etc/hosts
192.168.1.100 namenode.socal.rr.com
192.168.1.141 datanode1.socal.rr.com
192.168.1.113 datanode2.socal.rr.com
192.168.1.118 datanode3.socal.rr.com

1.5 Install JDK1.8 on all 4 nodes

Apache hadoop build on Java hence it need Java to run. Install openJDK Java as below. If you wanted to to use other JDK please do so according to your need.


sudo apt-get -y install openjdk-8-jdk-headless

Post JDK install, check if it installed successfully by running java -version

2 Download and Install Apache Hadoop

In this section, you will download Apache Hadoop and install on all nodes in a cluster (1 name node and 3 data nodes).

2.1 Apache Hadoop Installation on all Nodes

Download Apache Hadoop latest version using wget command.


wget http://apache.cs.utah.edu/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz

Once your download is complete, unzip the file’s contents using tar, a file archiving tool for Ubuntu and rename the folder to hadoop.


tar -xzf hadoop-3.1.1.tar.gz 
mv hadoop-3.1.1 hadoop

2.2 Apache Hadoop configuration – Setup environment variables.

Add Hadoop environment variables to .bashrc file. Open .bashrc file in vi editor and add below variables.


vi ~/.bashrc
export HADOOP_HOME=/home/ubuntu/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}

Now re-load the environment variables to the opened session or close and open the shell.


source ~/.bashrc

3. Configure Apache Hadoop Cluster

Once Apache Hadoop Installation completes, you need to configure it by changing some configuration files. Make the below configurations on the name node and copy these configurations to all 3 data nodes in a cluster.

3.1 Update hadoop-env.sh

Edit hadoop-env.sh file and the JAVA_HOME

vi ~/hadoop/etc/hadoop/hadoop-env.sh


vi ~/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

3.2 Update core-site.xml

Edit core-site.xml file

vi ~/hadoop/etc/hadoop/core-site.xml


<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.1.100:9000</value>
    </property>
</configuration>

3.3 Update hdfs-site.xml

vi ~/hadoop/etc/hadoop/hdfs-site.xml


<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///usr/local/hadoop/hdfs/data</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///usr/local/hadoop/hdfs/data</value>
    </property>
</configuration>

3.4 Update yarn-site.xml

vi ~/hadoop/etc/hadoop/yarn-site.xml


<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
       <name>yarn.resourcemanager.hostname</name>
       <value>192.168.1.100</value>
    </property>
</configuration>

3.5 Update mapred-site.xml 

vi ~/hadoop/etc/hadoop/mapred-site.xml

[Note: This configuration required only on name node however, it will not harm if you configure it on datanodes]


<configuration>
    <property>
        <name>mapreduce.jobtracker.address</name>
        <value>192.168.1.100:54311</value>
    </property>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

3.6 Create data folder

create a data folder and change its permissions to the login user. I’ve logged in as a ubuntu user, so you see with ubuntu.


sudo mkdir -p /usr/local/hadoop/hdfs/data
sudo chown ubuntu:ubuntu -R /usr/local/hadoop/hdfs/data
chmod 700 /usr/local/hadoop/hdfs/data

4. Create master and workers files

4.1 Create master file

The file masters is used by startup scripts to identify the name node. so vi<strong> ~/hadoop/etc/hadoop/masters</strong> and add your name node IP.

192.168.1.100

4.2 Create workers file

The file workers is used by startup scripts to identify data nodes. vi ~/hadoop/etc/hadoop/workers and add all your data node IP’s to it.

192.168.1.141
192.168.1.113
192.168.1.118

This completes <strong>Apache Hadoop installation</strong> and Apache Hadoop Cluster Configuration.

5 Format HDFS and Start Hadoop Cluster

5.1 Format HDFS

HDFS needs to be formatted like any classical file system. On Name Node server (namenode), run the following command:


hdfs namenode -format

Your Hadoop installation is now configured and ready to run.

5.2 Start HDFS Cluster

Start the HDFS by running the start-dfs.sh script from Name Node Server (namenode)


ubuntu@namenode:~$ start-dfs.sh
Starting namenodes on [namenode.socal.rr.com]
Starting datanodes
Starting secondary namenodes [namenode]
ubuntu@namenode:~$

Running jps command on namenode should list the following


ubuntu@namenode:~$ jps
18978 SecondaryNameNode
19092 Jps
18686 NameNode

Running jps command on datanodes should list the following


ubuntu@datanode1:~$ jps
14012 Jps
11242 DataNode

And by accessing http://192.168.1.100:9870 you should see the following namenode web UI

Apache Hadoop Installation
Apache Hadoop Installation

5.3 Upload File to HDFS

Writing and reading to HDFS is done with the command hdfs dfs. First, manually create your home directory. All other commands will use a path relative to this default home directory: (note that ubuntu is my logged-in user. If you login with a different user then please use your userid instead of ubuntu)


hdfs dfs -mkdir -p /user/ubuntu/

Get a books file from the Gutenberg project


wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt

upload downloaded file to hdfs using -put option


hdfs dfs -mkdir books
hdfs dfs -put books/alice.txt

List a file on hdfs


hdfs dfs -ls

There are many commands to manage your HDFS. For a complete list, you can look at the Apache HDFS shell documentation

5.4 Stopping HDFS Cluster

Run stop-dfs.sh file to stop HDFS.


stop-dfs.sh
Stopping namenodes on [namenode.socal.rr.com]
Stopping datanodes
Stopping secondary namenodes [namenode]

Next Steps

Naveen

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn LinkedIn

This Post Has One Comment

  1. Anonymous

    Thanks you so much, it is very clear!

Comments are closed.