You are currently viewing Spark Rename Multiple Columns Examples

How do perform rename multiple columns in Spark DataFrame? In Apache Spark DataFrame, a column represents a named expression that produces a value of a specific data type. You can think of a column as a logical representation of a data field in a table.

Advertisements

In this article, we shall discuss what is how to rename multiple columns or all columns with examples. So, let’s first create a Spark DataFrame with a few columns and use this DataFrame to rename multiple columns.


// Import
import org.apache.spark.sql.SparkSession

// Create SparkSession
val spark:SparkSession = SparkSession.builder()
    .master("local[1]").appName("SparkByExamples.com")
    .getOrCreate()

// Create DataFrame
import spark.implicits._
val data = Seq((1, "John", 20), (2, "Jane", 25), (3, "Jim", 30))
val df = data.toDF("id", "name", "age")

// Show DataFrame
df.show()

Yields below output.

spark rename multiple columns

1. Spark Rename Multiple Columns

To rename multiple columns in Spark you can use the withColumnRenamed() method from the DataFrame, this method takes the old column name and new column name as an argument and returns a DataFrame after renaming a column, so to rename multiple columns you can chain this function as shown below.


// Rename multiple columns
val df2 = df.withColumnRenamed("id","student_id")
            .withColumnRenamed("name","student_name")

// Show DataFrame
df2.show()

This example yields the below output. Note that the column name id was renamed to student_id and the name was renamed to student_name.

spark multiple columns

2. Rename Multiple Column Names from map()

If you have many columns to rename, chaining withColumnRenamed() doesn’t look good so, alternatively, you can rename multiple columns by creating a map object with old and new column names as pairs.


// Import
import org.apache.spark.sql.functions.col

// map() with column names to rename
val columnsToRename = Map("id" -> "student_id", "name" -> "student_name")

// Rename multiple columns
val renamedDF = columnsToRename.foldLeft(df){
    case (tempDF, (oldName, newName)) => tempDF.withColumnRenamed(oldName, newName)
}
 
// Show DataFrame
renamedDF.show()

In this example,

  • We define a map called columnsToRename, where the keys represent the old column names and the values represent the new column names.
  • We then use the foldLeft operation to iterate over the columnsToRename map and rename the columns one by one.
  • The withColumnRenamed function is used to rename each column in the tempDF DataFrame.
  • Finally, we assign the renamed DataFrame to a new variable renamedDF and display it using the show function.

The output of the code for the Spark Rename multiple columns above should be:

3. Rename All Columns from the List

If you wanted to rename all columns, you can easily do by creating a list with new columns and passing it as an argument to toDF() function.


// List with new column names 
val newColumnNames = Seq("new_id", "new_name", "new_age")

// Rename all columns
val df3 = df.toDF(newColumnNames:_*)
df3.show()

Or you can also use as below example.


// List with new column names 
val newColumnNames = Seq("new_id", "new_name", "new_age")

// Rename all columns
val df4 = newColumnNames.foldLeft(df)((tempDF, newName) => 
      tempDF.withColumnRenamed(tempDF.columns(newColumnNames.indexOf(newName)), newName))

In this example,

  • We define a list called newColumnNames, which contains the new column names in the order we want them to appear in the DataFrame.
  • We then use the foldLeft operation to iterate over the newColumnNames list and rename the columns one by one.
  • The withColumnRenamed function is used to rename each column in the tempDF DataFrame.
  • We use the columns function to get an array of the current column names and indexOf function to find the index of the old column name in the array.
  • Finally, we assign the renamed DataFrame to a new variable df4 and display it using the show function.

The output of the code for the Spark Rename multiple columns above should be:


// Output:
+------+--------+-------+
|new_id|new_name|new_age|
+------+--------+-------+
|     1|    John|     20|
|     2|    Jane|     25|
|     3|     Jim|     30|
+------+--------+-------+

4. Using a for loop and dynamic column names

Finally, you can also iterate the list with new columns and use the for loop to iterate it, and use withColumnRenamed() to rename columns.


// get old column names
val oldColumnNames = df.columns

// New column names
val newColumnNames = oldColumnNames.map(name => s"new_$name")

// Use for loop to rename
for (i <- 0 until oldColumnNames.length) {
  df = df.withColumnRenamed(oldColumnNames(i), newColumnNames(i))
}

// Show DataFrame
df.show()

In this example,

  • we define the DataFrame df with columns “id”, “name”, and “age”.
  • We then define an array oldColumnNames that contains the current column names of df.
  • We then use the map function to create a new array newColumnNames that contains the new column names, where each name is the old name with the prefix “new_” added to it.
  • We then use a for loop to iterate over the oldColumnNames array and rename each column using the withColumnRenamed function.
  • The withColumnRenamed function takes two arguments: the old column name and the new column name.
  • Finally, we display the renamed DataFrame using the show function.

The output of the code above should be:


// Output:
+------+--------+-------+
|new_id|new_name|new_age|
+------+--------+-------+
|     1|    John|     20|
|     2|    Jane|     25|
|     3|     Jim|     30|
+------+--------+-------+

4. Other Spark Column Operations

In Spark, a column refers to a logical data structure representing a named expression that produces a value for each record in a DataFrame. Columns are the building blocks for constructing DataFrame transformations and manipulations in Spark.

To work with columns in Spark Scala, you can use the org.apache.spark.sql.functions package. This package provides many built-in functions for manipulating and transforming columns in a DataFrame.

Here are some common operations you can perform on columns in Spark Scala:

  1. Selecting Columns: To select one or more columns from a DataFrame, you can use the select function. For example, to select columns col1 and col2 from a DataFrame df, you can write df.select("col1", "col2").
  2. Filtering Rows: To filter rows based on a condition, you can use the filter() or where() function. For example, to filter rows where the value in the col1 column is greater than 10, you can write df.filter(col("col1") > 10).
  3. Adding Columns: To add a new column to a DataFrame, you can use the withColumn() function. For example, to add a new column new_col that is the sum of col1 and col2, you can write df.withColumn("new_col", col("col1") + col("col2")).
  4. Renaming Columns: To rename a column in a DataFrame, you can use the withColumnRenamed() function. For example, to rename a column col1 to new_col1, you can write df.withColumnRenamed("col1", "new_col1").
  5. Aggregating Data: To aggregate data based on one or more columns, you can use the groupBy() function. For example, to group data by col1 column and compute the sum of the col2 column for each group, you can write df.groupBy("col1").agg(sum("col2")).

These are just a few examples of what you can do with columns in Spark Scala. The org.apache.spark.sql.functions package provides many more functions for manipulating and transforming columns, so it’s worth exploring the documentation to learn more.

5. Conclusion

In this article, you have learned different ways of renaming multiple columns in Spark, some approaches involve explicitly specifying the new names for each column using the withColumnRenamed() function or passing a list of old and new column names to the toDF() method. It all depends on the requirement from the action standpoint.

Related Articles

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.