You are currently viewing Create Spark Java Project in IntelliJ with Maven

How to create a Spark Java Project in IntelliJ and run a Maven build? Running Apache Spark in Java is a viable option, and it can be a good choice depending on your project’s requirements and your team’s familiarity with Java. Apache Spark supports multiple programming languages, including Scala, Python, and Java. In this article, I have explained a step-by-step guide on running Spark in Java, IntelliJ, and Maven.

To create a Spark Java project in IntelliJ IDEA and build it with a Maven, follow these steps:

Step 1: Install IntelliJ IDEA: If you haven’t already, download and install IntelliJ IDEA from the official website. You can use the free Community edition or the Ultimate edition for more advanced features.

Step 2: Install Java: Make sure you have Java Development Kit (JDK) installed on your system. You can download it from the Oracle website or use OpenJDK.

Step 3: Create a New Project: Open IntelliJ IDEA and create a new Java project:

  • Click on “File” -> “New” -> “Project.”
  • On the New Project window, fill in the Name, Location, Language, Built system, and JDK version (Choose JDK 11 version).
  • Make sure you select Java for the Language and Maven for the Build system.
  • From the Advanced Settings, Fill out the group and artifact ID information.
create spark java project

Step 4: Add Spark Dependency: In your pom.xml (Maven project file), add the Apache Spark dependencies.


<!-- Add tthe following to your pom.xml file -->

    <dependencies>
        <!-- Spark dependencies -->
        <!-- Use the appropriate version for your setup -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.5.0</version>
            <scope>compile</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.5.0</version>
            <scope>compile</scope>
        </dependency>
    </dependencies>

IntelliJ IDEA should automatically detect the changes and offer to import the Maven changes. If not, you can right-click on the pom.xml file and select Maven -> Reload project.

Step 5: Create a Spark Java Class: Create a new Java class that will serve as your Spark application. For example, you can create a class named SparkJavaExample.

Step 6: Write Your Spark Code: Write your Spark code in the SparkJavaExample class. Make sure to import necessary Spark classes and set up your SparkContext and SparkSession as needed. Below is an example that explains how to create Java RDD in Spark.


// Create Java RDD Example 
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class SparkJavaExample {
    public static void main(String args[]){

        // Create SparkSession
        SparkSession spark = SparkSession.builder()
                .appName("sparkbyexamples.com")
                .master("local[*]")
                .getOrCreate();

        // Create Java SparkContext
        JavaSparkContext jsc = new JavaSparkContext(
                spark.sparkContext());
        
        // Create RDD
        List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
        JavaRDD<Integer> rdd = jsc.parallelize(data);

        // Print rdd object
        System.out.println(rdd);
        
        // Print rdd contents to console
        rdd.collect().forEach(System.out::println);
        
        // Another RDD example
        List<String[]> dataList = new ArrayList<>();
        dataList.add(new String[] { "California", "CA" });
        dataList.add(new String[] { "New York", "NY" });
        
        // Create RDD
        JavaRDD<Row> rdd2 = jsc.parallelize(dataList)
                .map((String[] row) -> RowFactory.create(row));
        
        // Print rdd object
        System.out.println(rdd2);
        
        // Print RDD contents to console
        rdd2.collect().forEach(System.out::println);

        // Stop the SparkSession and JavaSparkContext
        spark.stop();
        jsc.stop();
    }
}

Step 7: Configure Run/Debug Configuration: Configure the run/debug settings in IntelliJ IDEA:

  • Click “Run” -> “Edit Configurations…”
  • Click the “+” button to add a new configuration and select “Application.”
  • Set the main class to your Spark application class (SparkJavaExample in this case).

Step 8: Run Your Spark Application: Click the green “Run” button to execute your Spark application. It will build the Maven project and run your Spark code.

Step 9: View Output: You can view the output of your Spark application in the IntelliJ IDEA console.

create spark java intellij

Conclusion

That’s it! You’ve created a Spark Java project in IntelliJ IDEA and successfully run a Maven build. Make sure to adjust the Spark version, Java version, and other dependencies in your pom.xml and Spark code as needed for your specific project requirements.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium