You are currently viewing Create Java DataFrame in Spark


To create a Java DataFrame, you’ll need to use the SparkSession, which is the entry point for working with structured data in Spark, and use the method SparkSession.createDataFrame() to create from List/Map collection. Besides this you can also use toDF() to create a DataFrame from an RDD or another DataFrame.

You can also create a Java DataFrame from different sources like TextCSVJSONXMLParquetAvroORCBinary files, RDBMS TablesHiveHBase, and many more.

Before proceeding with this article, make sure you create a Spark Java project in IntelliJ and are able to build it with Maven without issues and follow this article to create a Java DataFrame in Spark

Create Java DataFrame from JavaRDD

One simplest way to create a Java DataFrame is by using createDataFrame() which takes the JavaRDD[Row] type and schema for column names as arguments. You can create a schema using StructType & StructField.

Here’s an example of how to create a simple DataFrame using Apache Spark’s Java API. First, you need to set up your SparkSession, SparkContext, define the schema, and then populate the DataFrame.


package com.sparkbyexamples;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.ArrayList;
import java.util.List;

public class CreateDF {
    public static void main(String args[]){

        // Create SparkSession
        SparkSession spark = SparkSession.builder()
                .appName("sparkbyexamples.com")
                .master("local[*]")
                .getOrCreate();

        // Create Java SparkContext
        JavaSparkContext jsc = new JavaSparkContext(
                spark.sparkContext());

        // Create collection
        List<String[]> dataList = new ArrayList<>();
        dataList.add(new String[] { "California", "CA" });
        dataList.add(new String[] { "New York", "NY" });

        // Create RDD
        JavaRDD<Row> rdd = jsc.parallelize(dataList)
                .map((String[] row) -> RowFactory.create(row));

        // Create StructType schema
        List<StructField> fields = new ArrayList<StructField>();
        fields.add(DataTypes.createStructField("col1", DataTypes.StringType, true));
        fields.add(DataTypes.createStructField("col2", DataTypes.StringType, true));
        StructType structType = DataTypes.createStructType(fields);

        // Create DataFrame
        Dataset<Row> df = spark.createDataFrame(rdd,structType);
        df.show();

        // Stop the SparkSession and JavaSparkContext
        spark.stop();
        jsc.stop();
    }
}

Yields below output. To run this in distributed environment, you need to configure Spark for your specific environment, which may involve adjusting the "local" argument in setMaster and other settings according to your cluster setup.


// Output:
+----------+----+
|      col1|col2|
+----------+----+
|California|  CA|
|  New York|  NY|
+----------+----+

Create Java DataFrame from List Collection

Alternatively, you can also create a Java DataFrame without create a JavaRDD first. In order to do so, first you need to create a list of objects and pass the object type as schema argument while creation.


   // Import Dataset
   import org.apache.spark.sql.Dataset;

   // Create a list of Java objects
   List<Student> data = Arrays.asList(
          new Student("Scott", 56),
          new Student("Mike", 45),
          new Student("Robert", 26)
   );

   // Create a DataFrame directly from the list of objects
   Dataset<Row> df = spark.createDataFrame(data, Person.class);

Notice that in the above example we have create a Student class and used it as schema while creating a Java DataFrame.


public static class Student {
        private String name;
        private int age;

        public Student(String name, int age) {
            this.name = name;
            this.age = age;
        }

        public String getName() {
            return name;
        }

        public int getAge() {
            return age;
        }
}

In this example:

  1. We define a simple Student class to represent the data structure.
  2. We create a list of Student objects directly with the data you want.
  3. We create a DataFrame directly from the list of objects using spark.createDataFrame(data, Student.class).
  4. Finally, we use df.show() to display the contents of the DataFrame.

This approach is more convenient and eliminates the need to create an RDD explicitly.

This example yields the below output.


// Output:
+------+---+
|  name|age|
+------+---+
| Scott| 56|
|  Mike| 45|
|Robert| 26|
+------+---+

Create Java DataFrame from Existing DataFrame

You can create a new Java DataFrame from an existing DataFrame by applying transformations, filters, and other operations to the original DataFrame. Here’s an example of how to create a new DataFrame from an existing one by selecting specific columns.


    // Create a new DataFrame by selecting specific columns
    Dataset<Row> newDataFrame = df.select("name");

    // Show the new DataFrame
    newDataFrame.show();

Conclusion

In Java, you can create a DataFrame using libraries like Apache Spark’s DataFrame API, Apache Hadoop’s Hive, or third-party libraries like Apache Arrow and Apache Cassandra. One of the most popular libraries for working with DataFrames in Java is Apache Spark. I hope you learned how to create Java DataFrame from RDD, from collection list with examples.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium