How can I remove the header from Spark Dataframe/Dataset? We often require removing a header or column names while writing a DataFrame/Dataset output to a CSV file. The header row contains the column names of the DataFrame, and it is typically the first row in the output file. Including or excluding the header row can depend on the specific use case, but in some cases, removing the header row can make the output more suitable for further processing or analysis.
In this article, we shall discuss How to remove headers from a Spark DataFrame and high-level details on Spark DataFrame Header.
1. Header in Spark DataFrame
In Spark Scala, a header in a DataFrame/Dataset refers to the first row of the frame that contains the column names. The header row provides descriptive labels for the data in each column and helps to make the DataFrame more readable and easier to work with.
For example, consider the following DataFrame:
// Output:
+-----------+----------+---+
|first_name | last_name|age|
+-----------+----------+---+
| John | Doe| 25|
| Jane | Doe| 30|
| Bob | Smith| 45|
+-----------+----------+---+
In this DataFrame, the header row contains the column names first_name
, last_name
, and age
. These names are used to identify and access the data in each column.
Note that not all DataFrames have headers. For example, if you load a CSV file that does not contain a header row, Spark Scala will create default column names (e.g. _c0
, _c1
, etc.) for the DataFrame. In this case, you may need to manually add a header row to the DataFrame using the toDF
method with a list of column names as an argument.
2. Remove the Header while writing to a CSV file
In Spark, you can control whether or not to write the header row when writing a DataFrame to a CSV file, by using the header
option. When the header
option is set to true
(the default), Spark includes the header row in the output file. When the header
option is set to false
, Spark excludes the header row while writing DataFrame to csv file.
Related: Spark Write CSV File with Different Options
Here is an example code snippet:
// Imports
import org.apache.spark.sql.{DataFrame, SparkSession}
// Create Spark Session
val spark = SparkSession.builder()
.appName("RemoveHeaderExample")
.master("local[*]")
.getOrCreate()
// Create a dataFrame of Name and Age
val data: Seq[(String, Int)] = Seq(("Alice", 25), ("Bob", 30), ("Charlie", 35))
val df: DataFrame = spark.createDataFrame(data).toDF("name", "age")
// Write the Spark Dataframe into CSV file Without Header
df.write
.option("header", false) // disable writing header
.csv("output.csv")
In this example,
- the
DataFrame
is created from a sequence of tuples and has two columns:name
andage
. - The
write
method is called on theDataFrameWriter
object returned bydf.write
to write theDataFrame
to a CSV file. - The
option
method is used to pass options to the CSV writer. - Here, we set the
header
option tofalse
to disable writing the header row.
Note that when you read the CSV file later, you may need to specify the schema manually, as Spark may infer the schema from the data and assume that the first row is a header. You can do this by using the schema
method of the DataFrameReader
object and passing the schema as a StructType
object.
3. Remove Header from Spark DataFrame
There are different ways to remove headers from a Spark DataFrame, depending on the use case and the specific requirements of the task at hand. Here are some of the ways to remove Headers from Spark DataFrame.
The below examples are not frequently used however, I would like to cover them as they are some of the options.
3.1 Using withColumnRenamed() method with an empty string
Let’s use the withColumnRenamed() to rename all columns to empty, by using this you can also rename multiple columns in Spark.
// Imports
import org.apache.spark.sql.{SparkSession, DataFrame}
// Creating Spark Session
val spark = SparkSession.builder().appName("Remove All Column Names Example").getOrCreate()
// Create DataFrame
val data = Seq(
("John", "Doe", 25),
("Jane", "Doe", 30),
("Bob", "Smith", 45)
)
val columns = Seq("first_name", "last_name", "age")
val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)
df.show()
// Results:
+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
| John| Doe| 25|
| Jane| Doe| 30|
| Bob| Smith| 45|
+----------+---------+---+
val dfWithoutColumns = columns.foldLeft(df)((acc, c) => acc.withColumnRenamed(c, ""))
dfWithoutColumns.show()
//Results
+----+-----+---+
| | | |
+----+-----+---+
|John| Doe| 25|
|Jane| Doe| 30|
| Bob|Smith| 45|
+----+-----+---+
In this example, we have used the foldLeft
method with a sequence of columns (i.e., the header column) and the DataFrane. We then use the withColumnRenamed
method to rename the columns to empty in the DataFrame.
3.2 Using select() method with no arguments
// Spark Imports
import org.apache.spark.sql.{SparkSession, DataFrame}
val spark = SparkSession.builder().appName("Remove All Column Names Example").getOrCreate()
val data = Seq(
("John", "Doe", 25),
("Jane", "Doe", 30),
("Bob", "Smith", 45)
)
val columns = Seq("first_name", "last_name", "age")
val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)
df.show()
// Results:
+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
| John| Doe| 25|
| Jane| Doe| 30|
| Bob| Smith| 45|
+----------+---------+---+
// Remove all column names
val dfWithoutColumns2 = df.select((0 until df.columns.length).map(i => col(s"_c$i")): _*)
dfWithoutColumns2.show()
// Results:
+----+-----+---+
| | | |
+----+-----+---+
|John| Doe| 25|
|Jane| Doe| 30|
| Bob|Smith| 45|
+----+-----+---+
In this example, we use the select
method with a sequence of column indices to select all columns except the first column (i.e., the header column). We create this sequence using the 0 until df.columns.length
notation to generate a sequence of integers from 0 to the number of columns in the DataFrame, and then map each integer to the corresponding column using the col
method. Finally, we use the :_*
notation to pass the sequence as individual arguments to the select
method.
3.3. Using toDF() method with no arguments
// Spark Imports
import org.apache.spark.sql.{SparkSession, DataFrame}
// Create Sesion
val spark = SparkSession.builder().appName("Remove All Column Names Example").getOrCreate()
val data = Seq(
("John", "Doe", 25),
("Jane", "Doe", 30),
("Bob", "Smith", 45)
)
val columns = Seq("first_name", "last_name", "age")
// Create DataFrame
val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)
df.show()
// Results:
+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
| John| Doe| 25|
| Jane| Doe| 30|
| Bob| Smith| 45|
+----------+---------+---+
// Remove all column names
val dfWithoutColumns3 = df.toDF(df.columns.tail: _*)
dfWithoutColumns3.show()
// Results:
+----+-----+---+
| | | |
+----+-----+---+
|John| Doe| 25|
|Jane| Doe| 30|
| Bob|Smith| 45|
+----+-----+---+
In this example, we use the toDF
method with a list of column names that excludes the first column name (i.e., the header name) using the tail
method. This creates a new DataFrame with the same data as the original DataFrame but without the header row.
4. Conclusion
Removing header or column names from a Spark DataFrame may not be a common operation, as it can make it difficult to work with the data. In practice, you may want to keep at least some descriptive information about the columns. However, sometimes you would be required to remove the header column names while writing to a CSV file.