Below are the steps of Apache Hadoop Installation on a Linux Ubuntu server, if you have a windows laptop with enough memory, you can create 4 virtual machines by using Oracle Virtual Box and install Ubuntu on these VM’s. This article assumes you have Ubuntu OS running and doesn’t explain how to create VM’s and install Ubuntu.
Apache Hadoop is an open-source distributed storing and processing framework that is used to execute large data sets on commodity hardware; Hadoop natively runs on Linux operating system, in this article I will explain step by step Apache Hadoop installation version (Hadoop 3.1.1) on a multi-node cluster on Ubuntu (one name node and 3 data nodes).
Hadoop has two main components storage and processing; HDFS is used as a distributed storage and MapReduce/Yarn is used for processing.
HDFS is a Distributed File System store in Hadoop which is a Master-Slave architecture.
Name Node is a master node of HDFS that maintains the metadata of the files that are stored in the Data Node (Slave Nodes). Name Node is a single point of failure on HDFS, meaning failure of Name node brings down the entire Hadoop cluster hence, Name Node ideally runs on powerful commercial hardware. Note that on Name Node “namenode” Java service would be running.
Data Node is where the actual data of files are stored in HDFS, these are slave nodes and runs on commodity hardware. Note that on Data node servers “datanode” Java service would be running.
Here are the 4 nodes and their IP addresses I will be referring to in this Apache Hadoop Installation on Linux Ubuntu article. and, my login user is ubuntu.
IP Adress | Host Name |
---|---|
192.168.1.100 | namenode |
192.168.1.141 | datanode1 |
192.168.1.113 | datanode2 |
192.168.1.118 | datanode3 |
1. Apache Hadoop Installation – Preparation
In this section, I will explain how to Installation of some pre-requisites software’s, libraries and configurations of Hadoop on Linux Ubuntu OS. Follow and complete the below steps on all Nodes of a cluster.
1.1 Update the Source List of Ubuntu
First, update the ubuntu source list before we start Installing Apache Hadoop.
sudo apt-get update
1.2 Install SSH
If you don’t have Secure Shell (SSH), install SSH on server.
sudo apt-get install ssh
1.3 Setup Passwordless login Between Name Node and all Data Nodes.
The name node will use an ssh-connection to connect to other nodes in a cluster with key-pair authentication, to manage the cluster. hence let’s generate key-pair using ssh-keygen
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
ssh-keygen
command creates below files.
ubuntu@namenode:~$ ls -lrt .ssh/
-rw-r--r-- 1 ubuntu ubuntu 397 Dec 9 00:17 id_rsa.pub
-rw------- 1 ubuntu ubuntu 1679 Dec 9 00:17 id_rsa
Copy id_rsa.pub
to authorized_keys
under ~/.ssh
folder. By using >> it appends the contents of the id_rsa.pub
file to authorized_keys
cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Now copy authorized_keys to all data nodes in a cluster. This enables name node to connect to data nodes password less (without prompting for password)
scp .ssh/authorized_keys datanode1:/home/ubuntu/.ssh/authorized_keys
scp .ssh/authorized_keys datanode2:/home/ubuntu/.ssh/authorized_keys
scp .ssh/authorized_keys datanode3:/home/ubuntu/.ssh/authorized_keys
1.4 Add all our nodes to /etc/hosts.
sudo vi /etc/hosts
192.168.1.100 namenode.socal.rr.com
192.168.1.141 datanode1.socal.rr.com
192.168.1.113 datanode2.socal.rr.com
192.168.1.118 datanode3.socal.rr.com
1.5 Install JDK1.8 on all 4 nodes
Apache hadoop build on Java hence it need Java to run. Install openJDK Java as below. If you wanted to to use other JDK please do so according to your need.
sudo apt-get -y install openjdk-8-jdk-headless
Post JDK install, check if it installed successfully by running java -version
2 Download and Install Apache Hadoop
In this section, you will download Apache Hadoop and install on all nodes in a cluster (1 name node and 3 data nodes).
2.1 Apache Hadoop Installation on all Nodes
Download Apache Hadoop latest version using wget
command.
wget http://apache.cs.utah.edu/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
Once your download is complete, unzip the file’s contents using tar, a file archiving tool for Ubuntu and rename the folder to hadoop.
tar -xzf hadoop-3.1.1.tar.gz
mv hadoop-3.1.1 hadoop
2.2 Apache Hadoop configuration – Setup environment variables.
Add Hadoop environment variables to .bashrc file. Open .bashrc file in vi editor and add below variables.
vi ~/.bashrc
export HADOOP_HOME=/home/ubuntu/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
Now re-load the environment variables to the opened session or close and open the shell.
source ~/.bashrc
3. Configure Apache Hadoop Cluster
Once Apache Hadoop Installation completes, you need to configure it by changing some configuration files. Make the below configurations on the name node and copy these configurations to all 3 data nodes in a cluster.
3.1 Update hadoop-env.sh
Edit hadoop-env.sh
file and the JAVA_HOME
vi ~/hadoop/etc/hadoop/hadoop-env.sh
vi ~/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
3.2 Update core-site.xml
Edit core-site.xml file
vi ~/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.1.100:9000</value>
</property>
</configuration>
3.3 Update hdfs-site.xml
vi ~/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop/hdfs/data</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop/hdfs/data</value>
</property>
</configuration>
3.4 Update yarn-site.xml
vi ~/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>192.168.1.100</value>
</property>
</configuration>
3.5 Update mapred-site.xml
vi ~/hadoop/etc/hadoop/mapred-site.xml
[Note: This configuration required only on name node however, it will not harm if you configure it on datanodes]
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>192.168.1.100:54311</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
3.6 Create data folder
create a data folder and change its permissions to the login user. I’ve logged in as a ubuntu
user, so you see with ubuntu.
sudo mkdir -p /usr/local/hadoop/hdfs/data
sudo chown ubuntu:ubuntu -R /usr/local/hadoop/hdfs/data
chmod 700 /usr/local/hadoop/hdfs/data
4. Create master and workers files
4.1 Create master file
The file masters
is used by startup scripts to identify the name node. so vi<strong> ~/hadoop/etc/hadoop/masters</strong>
and add your name node IP.
192.168.1.100
4.2 Create workers file
The file workers
is used by startup scripts to identify data nodes. vi ~/hadoop/etc/hadoop/workers and add all your data node IP’s to it.
192.168.1.141
192.168.1.113
192.168.1.118
This completes <strong>Apache Hadoop installation</strong>
and Apache Hadoop Cluster Configuration.
5 Format HDFS and Start Hadoop Cluster
5.1 Format HDFS
HDFS needs to be formatted like any classical file system. On Name Node server (namenode), run the following command:
hdfs namenode -format
Your Hadoop installation is now configured and ready to run.
5.2 Start HDFS Cluster
Start the HDFS by running the start-dfs.sh
script from Name Node Server (namenode)
ubuntu@namenode:~$ start-dfs.sh
Starting namenodes on [namenode.socal.rr.com]
Starting datanodes
Starting secondary namenodes [namenode]
ubuntu@namenode:~$
Running jps
command on namenode should list the following
ubuntu@namenode:~$ jps
18978 SecondaryNameNode
19092 Jps
18686 NameNode
Running jps
command on datanodes should list the following
ubuntu@datanode1:~$ jps
14012 Jps
11242 DataNode
And by accessing http://192.168.1.100:9870 you should see the following namenode web UI
5.3 Upload File to HDFS
Writing and reading to HDFS is done with the command hdfs dfs
. First, manually create your home directory. All other commands will use a path relative to this default home directory: (note that ubuntu
is my logged-in user. If you login with a different user then please use your userid instead of ubuntu)
hdfs dfs -mkdir -p /user/ubuntu/
Get a books file from the Gutenberg project
wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
upload downloaded file to hdfs using -put option
hdfs dfs -mkdir books
hdfs dfs -put books/alice.txt
List a file on hdfs
hdfs dfs -ls
There are many commands to manage your HDFS. For a complete list, you can look at the Apache HDFS shell documentation
5.4 Stopping HDFS Cluster
Run stop-dfs.sh
file to stop HDFS.
stop-dfs.sh
Stopping namenodes on [namenode.socal.rr.com]
Stopping datanodes
Stopping secondary namenodes [namenode]
Thanks you so much, it is very clear!