How do perform rename multiple columns in Spark DataFrame? In Apache Spark DataFrame, a column represents a named expression that produces a value of a specific data type. You can think of a column as a logical representation of a data field in a table.
In this article, we shall discuss what is how to rename multiple columns or all columns with examples. So, let’s first create a Spark DataFrame with a few columns and use this DataFrame to rename multiple columns.
// Import
import org.apache.spark.sql.SparkSession
// Create SparkSession
val spark:SparkSession = SparkSession.builder()
.master("local[1]").appName("SparkByExamples.com")
.getOrCreate()
// Create DataFrame
import spark.implicits._
val data = Seq((1, "John", 20), (2, "Jane", 25), (3, "Jim", 30))
val df = data.toDF("id", "name", "age")
// Show DataFrame
df.show()
Yields below output.
1. Spark Rename Multiple Columns
To rename multiple columns in Spark you can use the withColumnRenamed() method from the DataFrame, this method takes the old column name and new column name as an argument and returns a DataFrame after renaming a column, so to rename multiple columns you can chain this function as shown below.
// Rename multiple columns
val df2 = df.withColumnRenamed("id","student_id")
.withColumnRenamed("name","student_name")
// Show DataFrame
df2.show()
This example yields the below output. Note that the column name id
was renamed to student_id
and the name
was renamed to student_name
.
2. Rename Multiple Column Names from map()
If you have many columns to rename, chaining withColumnRenamed() doesn’t look good so, alternatively, you can rename multiple columns by creating a map
object with old and new column names as pairs.
// Import
import org.apache.spark.sql.functions.col
// map() with column names to rename
val columnsToRename = Map("id" -> "student_id", "name" -> "student_name")
// Rename multiple columns
val renamedDF = columnsToRename.foldLeft(df){
case (tempDF, (oldName, newName)) => tempDF.withColumnRenamed(oldName, newName)
}
// Show DataFrame
renamedDF.show()
In this example,
- We define a
map
calledcolumnsToRename
, where the keys represent the old column names and the values represent the new column names. - We then use the
foldLeft
operation to iterate over thecolumnsToRename
map and rename the columns one by one. - The
withColumnRenamed
function is used to rename each column in thetempDF
DataFrame. - Finally, we assign the renamed DataFrame to a new variable
renamedDF
and display it using theshow
function.
The output of the code for the Spark Rename multiple columns above should be:
3. Rename All Columns from the List
If you wanted to rename all columns, you can easily do by creating a list with new columns and passing it as an argument to toDF() function.
// List with new column names
val newColumnNames = Seq("new_id", "new_name", "new_age")
// Rename all columns
val df3 = df.toDF(newColumnNames:_*)
df3.show()
Or you can also use as below example.
// List with new column names
val newColumnNames = Seq("new_id", "new_name", "new_age")
// Rename all columns
val df4 = newColumnNames.foldLeft(df)((tempDF, newName) =>
tempDF.withColumnRenamed(tempDF.columns(newColumnNames.indexOf(newName)), newName))
In this example,
- We define a list called
newColumnNames
, which contains the new column names in the order we want them to appear in the DataFrame. - We then use the
foldLeft
operation to iterate over thenewColumnNames
list and rename the columns one by one. - The
withColumnRenamed
function is used to rename each column in thetempDF
DataFrame. - We use the
columns
function to get an array of the current column names andindexOf
function to find the index of the old column name in the array. - Finally, we assign the renamed DataFrame to a new variable
df4
and display it using theshow
function.
The output of the code for the Spark Rename multiple columns above should be:
// Output:
+------+--------+-------+
|new_id|new_name|new_age|
+------+--------+-------+
| 1| John| 20|
| 2| Jane| 25|
| 3| Jim| 30|
+------+--------+-------+
4. Using a for loop and dynamic column names
Finally, you can also iterate the list with new columns and use the for loop to iterate it, and use withColumnRenamed() to rename columns.
// get old column names
val oldColumnNames = df.columns
// New column names
val newColumnNames = oldColumnNames.map(name => s"new_$name")
// Use for loop to rename
for (i <- 0 until oldColumnNames.length) {
df = df.withColumnRenamed(oldColumnNames(i), newColumnNames(i))
}
// Show DataFrame
df.show()
In this example,
- we define the DataFrame
df
with columns “id”, “name”, and “age”. - We then define an array
oldColumnNames
that contains the current column names ofdf
. - We then use the
map
function to create a new arraynewColumnNames
that contains the new column names, where each name is the old name with the prefix “new_” added to it. - We then use a
for
loop to iterate over theoldColumnNames
array and rename each column using thewithColumnRenamed
function. - The
withColumnRenamed
function takes two arguments: the old column name and the new column name. - Finally, we display the renamed DataFrame using the
show
function.
The output of the code above should be:
// Output:
+------+--------+-------+
|new_id|new_name|new_age|
+------+--------+-------+
| 1| John| 20|
| 2| Jane| 25|
| 3| Jim| 30|
+------+--------+-------+
4. Other Spark Column Operations
In Spark, a column refers to a logical data structure representing a named expression that produces a value for each record in a DataFrame. Columns are the building blocks for constructing DataFrame transformations and manipulations in Spark.
To work with columns in Spark Scala, you can use the org.apache.spark.sql.functions package. This package provides many built-in functions for manipulating and transforming columns in a DataFrame.
Here are some common operations you can perform on columns in Spark Scala:
- Selecting Columns: To select one or more columns from a DataFrame, you can use the
select
function. For example, to select columnscol1
andcol2
from a DataFramedf
, you can writedf.select("col1", "col2")
. - Filtering Rows: To filter rows based on a condition, you can use the
filter()
orwhere()
function. For example, to filter rows where the value in thecol1
column is greater than 10, you can writedf.filter(col("col1") > 10)
. - Adding Columns: To add a new column to a DataFrame, you can use the
withColumn()
function. For example, to add a new columnnew_col
that is the sum ofcol1
andcol2
, you can writedf.withColumn("new_col", col("col1") + col("col2"))
. - Renaming Columns: To rename a column in a DataFrame, you can use the
withColumnRenamed()
function. For example, to rename a columncol1
tonew_col1
, you can writedf.withColumnRenamed("col1", "new_col1")
. - Aggregating Data: To aggregate data based on one or more columns, you can use the
groupBy()
function. For example, to group data bycol1
column and compute the sum of thecol2
column for each group, you can writedf.groupBy("col1").agg(sum("col2"))
.
These are just a few examples of what you can do with columns in Spark Scala. The org.apache.spark.sql.functions
package provides many more functions for manipulating and transforming columns, so it’s worth exploring the documentation to learn more.
5. Conclusion
In this article, you have learned different ways of renaming multiple columns in Spark, some approaches involve explicitly specifying the new names for each column using the withColumnRenamed()
function or passing a list of old and new column names to the toDF()
method. It all depends on the requirement from the action standpoint.
Related Articles
- Spark Merge Two DataFrames with Different Columns or Schema
- Spark withColumnRenamed to Rename Column
- Spark RDD fold() function example
- Spark map() vs flatMap() with Examples
- Spark Internal Execution plan
- Get Other Columns when using GroupBy or Select All Columns with the GroupBy?
- Spark cannot resolve given input columns