You are currently viewing Remove Header from Spark DataFrame

How can I remove the header from Spark Dataframe/Dataset? We often require removing a header or column names while writing a DataFrame/Dataset output to a CSV file. The header row contains the column names of the DataFrame, and it is typically the first row in the output file. Including or excluding the header row can depend on the specific use case, but in some cases, removing the header row can make the output more suitable for further processing or analysis.

In this article, we shall discuss How to remove headers from a Spark DataFrame and high-level details on Spark DataFrame Header.

1. Header in Spark DataFrame

In Spark Scala, a header in a DataFrame/Dataset refers to the first row of the frame that contains the column names. The header row provides descriptive labels for the data in each column and helps to make the DataFrame more readable and easier to work with.

For example, consider the following DataFrame:


// Output:
+-----------+----------+---+
|first_name | last_name|age|
+-----------+----------+---+
|      John |       Doe| 25|
|      Jane |       Doe| 30|
|       Bob |     Smith| 45|
+-----------+----------+---+

In this DataFrame, the header row contains the column names first_name, last_name, and age. These names are used to identify and access the data in each column.

Note that not all DataFrames have headers. For example, if you load a CSV file that does not contain a header row, Spark Scala will create default column names (e.g. _c0, _c1, etc.) for the DataFrame. In this case, you may need to manually add a header row to the DataFrame using the toDF method with a list of column names as an argument.

2. Remove the Header while writing to a CSV file

In Spark, you can control whether or not to write the header row when writing a DataFrame to a CSV file, by using the header option. When the header option is set to true (the default), Spark includes the header row in the output file. When the header option is set to false, Spark excludes the header row while writing DataFrame to csv file.

Related: Spark Write CSV File with Different Options

Here is an example code snippet:


// Imports
import org.apache.spark.sql.{DataFrame, SparkSession}

// Create Spark Session
val spark = SparkSession.builder()
                        .appName("RemoveHeaderExample")
                        .master("local[*]")
                        .getOrCreate()

// Create a dataFrame of Name and Age
val data: Seq[(String, Int)] = Seq(("Alice", 25), ("Bob", 30), ("Charlie", 35))
val df: DataFrame = spark.createDataFrame(data).toDF("name", "age")

// Write the Spark Dataframe into CSV file Without Header
df.write
  .option("header", false) // disable writing header
  .csv("output.csv")

In this example,

  • the DataFrame is created from a sequence of tuples and has two columns: name and age.
  • The write method is called on the DataFrameWriter object returned by df.write to write the DataFrame to a CSV file.
  • The option method is used to pass options to the CSV writer.
  • Here, we set the header option to false to disable writing the header row.

Note that when you read the CSV file later, you may need to specify the schema manually, as Spark may infer the schema from the data and assume that the first row is a header. You can do this by using the schema method of the DataFrameReader object and passing the schema as a StructType object.

3. Remove Header from Spark DataFrame

There are different ways to remove headers from a Spark DataFrame, depending on the use case and the specific requirements of the task at hand. Here are some of the ways to remove Headers from Spark DataFrame.

The below examples are not frequently used however, I would like to cover them as they are some of the options.

3.1 Using withColumnRenamed() method with an empty string

Let’s use the withColumnRenamed() to rename all columns to empty, by using this you can also rename multiple columns in Spark.


// Imports
import org.apache.spark.sql.{SparkSession, DataFrame}

// Creating Spark Session
val spark = SparkSession.builder().appName("Remove All Column Names Example").getOrCreate()

// Create DataFrame
val data = Seq(
  ("John", "Doe", 25),
  ("Jane", "Doe", 30),
  ("Bob", "Smith", 45)
)

val columns = Seq("first_name", "last_name", "age")

val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)
df.show()

// Results:
+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|      John|      Doe| 25|
|      Jane|      Doe| 30|
|       Bob|    Smith| 45|
+----------+---------+---+

val dfWithoutColumns = columns.foldLeft(df)((acc, c) => acc.withColumnRenamed(c, ""))
dfWithoutColumns.show()
//Results
+----+-----+---+
|    |     |   |
+----+-----+---+
|John|  Doe| 25|
|Jane|  Doe| 30|
| Bob|Smith| 45|
+----+-----+---+

In this example, we have used the foldLeft method with a sequence of columns (i.e., the header column) and the DataFrane. We then use the withColumnRenamed method to rename the columns to empty in the DataFrame.

3.2 Using select() method with no arguments


// Spark Imports
import org.apache.spark.sql.{SparkSession, DataFrame}

val spark = SparkSession.builder().appName("Remove All Column Names Example").getOrCreate()

val data = Seq(
  ("John", "Doe", 25),
  ("Jane", "Doe", 30),
  ("Bob", "Smith", 45)
)

val columns = Seq("first_name", "last_name", "age")

val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)
df.show()

// Results:
+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|      John|      Doe| 25|
|      Jane|      Doe| 30|
|       Bob|    Smith| 45|
+----------+---------+---+

// Remove all column names
val dfWithoutColumns2 = df.select((0 until df.columns.length).map(i => col(s"_c$i")): _*)
dfWithoutColumns2.show()

// Results:
+----+-----+---+
|    |     |   |
+----+-----+---+
|John|  Doe| 25|
|Jane|  Doe| 30|
| Bob|Smith| 45|
+----+-----+---+

In this example, we use the select method with a sequence of column indices to select all columns except the first column (i.e., the header column). We create this sequence using the 0 until df.columns.length notation to generate a sequence of integers from 0 to the number of columns in the DataFrame, and then map each integer to the corresponding column using the col method. Finally, we use the :_* notation to pass the sequence as individual arguments to the select method.

3.3. Using toDF() method with no arguments


// Spark Imports
import org.apache.spark.sql.{SparkSession, DataFrame}

// Create Sesion
val spark = SparkSession.builder().appName("Remove All Column Names Example").getOrCreate()

val data = Seq(
  ("John", "Doe", 25),
  ("Jane", "Doe", 30),
  ("Bob", "Smith", 45)
)

val columns = Seq("first_name", "last_name", "age")

// Create DataFrame
val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)
df.show()

// Results:
+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|      John|      Doe| 25|
|      Jane|      Doe| 30|
|       Bob|    Smith| 45|
+----------+---------+---+

// Remove all column names
val dfWithoutColumns3 = df.toDF(df.columns.tail: _*)
dfWithoutColumns3.show()

// Results:
+----+-----+---+
|    |     |   |
+----+-----+---+
|John|  Doe| 25|
|Jane|  Doe| 30|
| Bob|Smith| 45|
+----+-----+---+

In this example, we use the toDF method with a list of column names that excludes the first column name (i.e., the header name) using the tail method. This creates a new DataFrame with the same data as the original DataFrame but without the header row.

4. Conclusion

Removing header or column names from a Spark DataFrame may not be a common operation, as it can make it difficult to work with the data. In practice, you may want to keep at least some descriptive information about the columns. However, sometimes you would be required to remove the header column names while writing to a CSV file.

Related Articles

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.