Apache Hadoop Installation on Ubuntu (multi-node cluster).

This document explains step by step Apache Hadoop installation version (Hadoop 3.1.1) with the name node (master node) and 3 data nodes (slave nodes) cluster on Ubuntu. Below are the 4 nodes and their IP addresses I will be referring to in this article. and, my login user is ubuntu.…

Continue Reading Apache Hadoop Installation on Ubuntu (multi-node cluster).

Create a Spark RDD using Parallelize

Let's see how to create Spark RDD using parallelize with sparkContext.parallelize() method and using Spark shell and Scala example. Before we start let me explain what is RDD, Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark, It is an immutable distributed collection of objects. Each dataset in RDD is…

Continue Reading Create a Spark RDD using Parallelize