You are currently viewing Create Java RDD from List Collection

Let’s explore how to create a Java RDD object from List Collection using the JavaSparkContext.parallelize() method within the Spark shell and from the application program.

Before we delve into the procedure, allow me to elucidate the concept of RDD (Resilient Distributed Datasets). RDD serves as a foundational data structure within Spark, representing an unalterable distributed collection of objects. Each dataset contained within an RDD is partitioned logically, enabling computation on different nodes within the cluster.

Spark Java RDDs can be created from various data sources such as in-memory collections, external storage (e.g., HDFS, S3), or by transforming existing RDDs. In this article, I will explain these with examples

Before proceeding with this article, make sure you create a Spark Java project in IntelliJ and are able to build it with Maven without issues.

Create Java RDD from Shell

spark-shell is an interactive command-line shell provided by Apache Spark for running Spark applications in an interactive, REPL (Read-Eval-Print Loop) style. It is designed for exploratory data analysis and development, making it easy for developers and data scientists to prototype and test Spark code. Let’s use this to create a Java RDD.

Enter spark-shell from the command line to launch it, and it launches an interactive session where you can interact with Spark.

Note that the spark-shell is commonly associated with the Scala programming language, but we can use it with Java as well with some exceptions.


// Import JavaSparkContext
import org.apache.spark.api.java.JavaSparkContext

// Create context
val jsc = new JavaSparkContext(sc)

// Create Java RDD
val data:java.util.List[Int]= java.util.Arrays.asList(1, 2, 3, 4, 5)
val rdd = jsc.parallelize(data)

// Print RDD to console
rdd.foreach(System.out.println)

By executing these statements in spark-shell, you will see something like below. I will explain what each statement does in the next program.

create java rdd JavaRDD

Create Java RDD (JavaRDD) from a List

Use JavaSparkContext.parallelize() to create an RDD (JavaRDD) from the collection list in Spark Java; pass the collection as an argument to this function. This function also has another signature which additionally takes an integer argument to specify the number of partitions. Partitions are basic units of parallelism in Apache Spark. RDDs in Apache Spark are a collection of partitions that are executed by processors to achieve parallelism


// Create Java RDD Example 
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class SparkJavaExample {
    public static void main(String args[]){

        // Create SparkSession
        SparkSession spark = SparkSession.builder()
                .appName("sparkbyexamples.com")
                .master("local[*]")
                .getOrCreate();

        // Create Java SparkContext
        JavaSparkContext jsc = new JavaSparkContext(
                spark.sparkContext());
        
        // Create RDD
        List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
        JavaRDD<Integer> rdd = jsc.parallelize(data);

        // Print rdd object
        System.out.println(rdd);
        
        // Print rdd contents to console
        rdd.collect().forEach(System.out::println);

        // Stop the SparkSession and JavaSparkContext
        spark.stop();
        jsc.stop();
    }
}

In this example, we first create a SparkSession and then create a JavaSparkContext using spark.sparkContext(). This allows you to work with both structured and unstructured data in a Spark application.

Then, we created a List collection with integers and used it to create a JavaRDD using JavaSparkContext.parallelize(data). It actually creates JavaRDD of type Integer.

Later printed the rdd oject to check the type of the object and printed the contents of the Java RDD.

Finally, It’s a good practice to stop both the SparkSession and JavaSparkContext when you are finished with your Java with Spark application to release the allocated resources.

Let’s create another Java RDD with Strings.


        // Another RDD example
        List<String[]> dataList = new ArrayList<>();
        dataList.add(new String[] { "California", "CA" });
        dataList.add(new String[] { "New York", "NY" });
        
        // Create RDD
        JavaRDD<Row> rdd2 = jsc.parallelize(dataList)
                .map((String[] row) -> RowFactory.create(row));
        
        // Print rdd object
        System.out.println(rdd2);
        
        // Print RDD contents to console
        rdd2.collect().forEach(System.out::println);

Java RDD Transformations & Actions

Transformation and Actions are operations that refer to the various functions and methods you can perform on Resilient Distributed Datasets (RDDs) and DataFrames. These operations can be categorized into two main types: transformations and actions.

Transformations are operations that produce a new RDD from an existing one. Examples of transformations include map, filter, flatMap, and groupByKey. Transformations are typically lazily evaluated, meaning they don’t execute immediately, but they build a logical execution plan. Here, I used a map() a transformation.


        // Use map() transformation
        JavaRDD<Integer> rdd3 = rdd.map(val -> (int) val *2);
        rdd3.collect().forEach(System.out::println);

Actions are operations that trigger the execution of the logical execution plan and return a value to the driver program or write data to an external system. Examples of actions include count, collect, saveAsTextFile, and reduce. Let’s use the reduce() action on RDD.


        // Use reduce()
        int result = rdd3.reduce(Integer::sum);
        System.out.println("Total Sum: "+result);

JavaRDD from a Text File

To create a JavaRDD from a text file in Apache Spark, you can use the JavaSparkContext and its textFile() method. Here’s an example of how to do it.

Make sure to adjust the "path/to/your/text/file.txt" to point to the location of your text file. This code will read the text file and create an RDD where each line of the file becomes an element in the RDD.


import org.apache.spark.sql.*;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class JavaRDDFromTextFile {
    public static void main(String[] args) {
        // Create SparkSession
        SparkSession spark = SparkSession.builder()
                .appName("sparkbyexamples.com")
                .master("local[*]")
                .getOrCreate();

        // Create Java SparkContext
        JavaSparkContext jsc = new JavaSparkContext(
                spark.sparkContext());

        // Specify the path to the text file you want to create an RDD from
        String filePath = "path/to/your/text/file.txt";

        // Create a JavaRDD from the text file
        JavaRDD<String> textFileRDD = sc.textFile(filePath);

        // Perform operations on the RDD (e.g., transformations, actions)
        long lineCount = textFileRDD.count();
        System.out.println("Total lines in the file: " + lineCount);

        // Stop the SparkSession and JavaSparkContext
        spark.stop();
        jsc.stop();
    }
}

Here,

  • We specify the path to the text file that we want to create an RDD from. Replace "path/to/your/text/file.txt" with the actual file path.
  • We use the textFile() method to create a JavaRDD of strings from the text file.
  • You can then perform various operations on the textFileRDD, such as counting the number of lines, applying transformations, or performing other actions.
  • Finally, we stop the SparkContext to release resources.

Conclusion

I hope now you have a better understanding of RDD and how to create Java RDD (JavaRDD) from a collection list and text file. Once you have an RDD, you can apply transformation and action operations.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium