Apache Hadoop Installation on Ubuntu (multi-node cluster).

Below are the steps of Apache Hadoop Installation on a Linux Ubuntu server, if you have a windows laptop with enough memory, you can create 4 virtual machines by using Oracle Virtual Box and install Ubuntu on these VM’s. This article assumes you have Ubuntu OS running and doesn’t explain how to create VM’s and install Ubuntu.

IP Adress	Host Name
192.168.1.100	namenode
192.168.1.141	datanode1
192.168.1.113	datanode2
192.168.1.118	datanode3

1. Apache Hadoop Installation – Preparation

In this section, I will explain how to Installation of some pre-requisites software’s, libraries and configurations of Hadoop on Linux Ubuntu OS. Follow and complete the below steps on all Nodes of a cluster.

1.1 Update the Source List of Ubuntu

First, update the ubuntu source list before we start Installing Apache Hadoop.


sudo apt-get update

1.2 Install SSH

If you don’t have Secure Shell (SSH), install SSH on server.


sudo apt-get install ssh

1.3 Setup Passwordless login Between Name Node and all Data Nodes.

The name node will use an ssh-connection to connect to other nodes in a cluster with key-pair authentication, to manage the cluster. hence let’s generate key-pair using ssh-keygen


ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

ssh-keygen command creates below files.


ubuntu@namenode:~$ ls -lrt .ssh/
-rw-r--r-- 1 ubuntu ubuntu 397 Dec 9 00:17 id_rsa.pub
-rw------- 1 ubuntu ubuntu 1679 Dec 9 00:17 id_rsa

Copy id_rsa.pub to authorized_keys under ~/.ssh folder. By using >> it appends the contents of the id_rsa.pub file to authorized_keys


cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now copy authorized_keys to all data nodes in a cluster. This enables name node to connect to data nodes password less (without prompting for password)


scp .ssh/authorized_keys datanode1:/home/ubuntu/.ssh/authorized_keys
scp .ssh/authorized_keys datanode2:/home/ubuntu/.ssh/authorized_keys
scp .ssh/authorized_keys datanode3:/home/ubuntu/.ssh/authorized_keys

1.4 Add all our nodes to /etc/hosts.


sudo vi /etc/hosts
192.168.1.100 namenode.socal.rr.com
192.168.1.141 datanode1.socal.rr.com
192.168.1.113 datanode2.socal.rr.com
192.168.1.118 datanode3.socal.rr.com

1.5 Install JDK1.8 on all 4 nodes

Apache hadoop build on Java hence it need Java to run. Install openJDK Java as below. If you wanted to to use other JDK please do so according to your need.


sudo apt-get -y install openjdk-8-jdk-headless

Post JDK install, check if it installed successfully by running java -version

2 Download and Install Apache Hadoop

In this section, you will download Apache Hadoop and install on all nodes in a cluster (1 name node and 3 data nodes).

2.1 Apache Hadoop Installation on all Nodes

Download Apache Hadoop latest version using wget command.


wget http://apache.cs.utah.edu/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz

Once your download is complete, unzip the file’s contents using tar, a file archiving tool for Ubuntu and rename the folder to hadoop.


tar -xzf hadoop-3.1.1.tar.gz 
mv hadoop-3.1.1 hadoop

2.2 Apache Hadoop configuration – Setup environment variables.

Add Hadoop environment variables to .bashrc file. Open .bashrc file in vi editor and add below variables.


vi ~/.bashrc
export HADOOP_HOME=/home/ubuntu/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}

Now re-load the environment variables to the opened session or close and open the shell.


source ~/.bashrc

3. Configure Apache Hadoop Cluster

Once Apache Hadoop Installation completes, you need to configure it by changing some configuration files. Make the below configurations on the name node and copy these configurations to all 3 data nodes in a cluster.

3.1 Update hadoop-env.sh

Edit hadoop-env.sh file and the JAVA_HOME

vi ~/hadoop/etc/hadoop/hadoop-env.sh


vi ~/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

3.2 Update core-site.xml

Edit core-site.xml file

vi ~/hadoop/etc/hadoop/core-site.xml


<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.1.100:9000</value>
    </property>
</configuration>

3.3 Update hdfs-site.xml

vi ~/hadoop/etc/hadoop/hdfs-site.xml


<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///usr/local/hadoop/hdfs/data</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///usr/local/hadoop/hdfs/data</value>
    </property>
</configuration>

3.4 Update yarn-site.xml

vi ~/hadoop/etc/hadoop/yarn-site.xml


<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
       <name>yarn.resourcemanager.hostname</name>
       <value>192.168.1.100</value>
    </property>
</configuration>

3.5 Update mapred-site.xml

vi ~/hadoop/etc/hadoop/mapred-site.xml

[Note: This configuration required only on name node however, it will not harm if you configure it on datanodes]


<configuration>
    <property>
        <name>mapreduce.jobtracker.address</name>
        <value>192.168.1.100:54311</value>
    </property>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

3.6 Create data folder

create a data folder and change its permissions to the login user. I’ve logged in as a ubuntu user, so you see with ubuntu.


sudo mkdir -p /usr/local/hadoop/hdfs/data
sudo chown ubuntu:ubuntu -R /usr/local/hadoop/hdfs/data
chmod 700 /usr/local/hadoop/hdfs/data

4. Create master and workers files

4.1 Create master file

The file masters is used by startup scripts to identify the name node. so vi<strong> ~/hadoop/etc/hadoop/masters</strong> and add your name node IP.

192.168.1.100

4.2 Create workers file

The file workers is used by startup scripts to identify data nodes. vi ~/hadoop/etc/hadoop/workers and add all your data node IP’s to it.

192.168.1.141
192.168.1.113
192.168.1.118

This completes <strong>Apache Hadoop installation</strong> and Apache Hadoop Cluster Configuration.

5 Format HDFS and Start Hadoop Cluster

5.1 Format HDFS

HDFS needs to be formatted like any classical file system. On Name Node server (namenode), run the following command:


hdfs namenode -format

Your Hadoop installation is now configured and ready to run.

5.2 Start HDFS Cluster

Start the HDFS by running the start-dfs.sh script from Name Node Server (namenode)


ubuntu@namenode:~$ start-dfs.sh
Starting namenodes on [namenode.socal.rr.com]
Starting datanodes
Starting secondary namenodes [namenode]
ubuntu@namenode:~$

Running jps command on namenode should list the following


ubuntu@namenode:~$ jps
18978 SecondaryNameNode
19092 Jps
18686 NameNode

Running jps command on datanodes should list the following


ubuntu@datanode1:~$ jps
14012 Jps
11242 DataNode

And by accessing http://192.168.1.100:9870 you should see the following namenode web UI

5.3 Upload File to HDFS

Writing and reading to HDFS is done with the command hdfs dfs. First, manually create your home directory. All other commands will use a path relative to this default home directory: (note that ubuntu is my logged-in user. If you login with a different user then please use your userid instead of ubuntu)


hdfs dfs -mkdir -p /user/ubuntu/

Get a books file from the Gutenberg project


wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt

upload downloaded file to hdfs using -put option


hdfs dfs -mkdir books
hdfs dfs -put books/alice.txt

List a file on hdfs


hdfs dfs -ls

There are many commands to manage your HDFS. For a complete list, you can look at the Apache HDFS shell documentation

5.4 Stopping HDFS Cluster

Run stop-dfs.sh file to stop HDFS.


stop-dfs.sh
Stopping namenodes on [namenode.socal.rr.com]
Stopping datanodes
Stopping secondary namenodes [namenode]

1. Apache Hadoop Installation – Preparation

1.1 Update the Source List of Ubuntu

1.2 Install SSH

1.3 Setup Passwordless login Between Name Node and all Data Nodes.

1.4 Add all our nodes to /etc/hosts.

1.5 Install JDK1.8 on all 4 nodes

2 Download and Install Apache Hadoop

2.1 Apache Hadoop Installation on all Nodes

2.2 Apache Hadoop configuration – Setup environment variables.

3. Configure Apache Hadoop Cluster

3.1 Update hadoop-env.sh

3.2 Update core-site.xml

3.3 Update hdfs-site.xml

3.4 Update yarn-site.xml

3.5 Update mapred-site.xml

3.6 Create data folder

4. Create master and workers files

4.1 Create master file

4.2 Create workers file

5 Format HDFS and Start Hadoop Cluster

5.1 Format HDFS

5.2 Start HDFS Cluster

5.3 Upload File to HDFS

5.4 Stopping HDFS Cluster

Next Steps

This Post Has One Comment

1. Apache Hadoop Installation – Preparation

1.1 Update the Source List of Ubuntu

1.2 Install SSH

1.3 Setup Passwordless login Between Name Node and all Data Nodes.

1.4 Add all our nodes to /etc/hosts.

1.5 Install JDK1.8 on all 4 nodes

2 Download and Install Apache Hadoop

2.1 Apache Hadoop Installation on all Nodes

2.2 Apache Hadoop configuration – Setup environment variables.

3. Configure Apache Hadoop Cluster

3.1 Update hadoop-env.sh

3.2 Update core-site.xml

3.3 Update hdfs-site.xml

3.4 Update yarn-site.xml

3.5 Update mapred-site.xml

3.6 Create data folder

4. Create master and workers files

4.1 Create master file

4.2 Create workers file

5 Format HDFS and Start Hadoop Cluster

5.1 Format HDFS

5.2 Start HDFS Cluster

5.3 Upload File to HDFS

5.4 Stopping HDFS Cluster

Next Steps

Related Articles

This Post Has One Comment