Site icon Spark By {Examples}

Spark Rename Multiple Columns Examples

Spark Rename multiple columns

How do perform rename multiple columns in Spark DataFrame? In Apache Spark DataFrame, a column represents a named expression that produces a value of a specific data type. You can think of a column as a logical representation of a data field in a table.

In this article, we shall discuss what is how to rename multiple columns or all columns with examples. So, let’s first create a Spark DataFrame with a few columns and use this DataFrame to rename multiple columns.


// Import
import org.apache.spark.sql.SparkSession

// Create SparkSession
val spark:SparkSession = SparkSession.builder()
    .master("local[1]").appName("SparkByExamples.com")
    .getOrCreate()

// Create DataFrame
import spark.implicits._
val data = Seq((1, "John", 20), (2, "Jane", 25), (3, "Jim", 30))
val df = data.toDF("id", "name", "age")

// Show DataFrame
df.show()

Yields below output.

spark rename multiple columns

1. Spark Rename Multiple Columns

To rename multiple columns in Spark you can use the withColumnRenamed() method from the DataFrame, this method takes the old column name and new column name as an argument and returns a DataFrame after renaming a column, so to rename multiple columns you can chain this function as shown below.


// Rename multiple columns
val df2 = df.withColumnRenamed("id","student_id")
            .withColumnRenamed("name","student_name")

// Show DataFrame
df2.show()

This example yields the below output. Note that the column name id was renamed to student_id and the name was renamed to student_name.

spark multiple columns

2. Rename Multiple Column Names from map()

If you have many columns to rename, chaining withColumnRenamed() doesn’t look good so, alternatively, you can rename multiple columns by creating a map object with old and new column names as pairs.


// Import
import org.apache.spark.sql.functions.col

// map() with column names to rename
val columnsToRename = Map("id" -> "student_id", "name" -> "student_name")

// Rename multiple columns
val renamedDF = columnsToRename.foldLeft(df){
    case (tempDF, (oldName, newName)) => tempDF.withColumnRenamed(oldName, newName)
}
 
// Show DataFrame
renamedDF.show()

In this example,

The output of the code for the Spark Rename multiple columns above should be:

3. Rename All Columns from the List

If you wanted to rename all columns, you can easily do by creating a list with new columns and passing it as an argument to toDF() function.


// List with new column names 
val newColumnNames = Seq("new_id", "new_name", "new_age")

// Rename all columns
val df3 = df.toDF(newColumnNames:_*)
df3.show()

Or you can also use as below example.


// List with new column names 
val newColumnNames = Seq("new_id", "new_name", "new_age")

// Rename all columns
val df4 = newColumnNames.foldLeft(df)((tempDF, newName) => 
      tempDF.withColumnRenamed(tempDF.columns(newColumnNames.indexOf(newName)), newName))

In this example,

The output of the code for the Spark Rename multiple columns above should be:


// Output:
+------+--------+-------+
|new_id|new_name|new_age|
+------+--------+-------+
|     1|    John|     20|
|     2|    Jane|     25|
|     3|     Jim|     30|
+------+--------+-------+

4. Using a for loop and dynamic column names

Finally, you can also iterate the list with new columns and use the for loop to iterate it, and use withColumnRenamed() to rename columns.


// get old column names
val oldColumnNames = df.columns

// New column names
val newColumnNames = oldColumnNames.map(name => s"new_$name")

// Use for loop to rename
for (i <- 0 until oldColumnNames.length) {
  df = df.withColumnRenamed(oldColumnNames(i), newColumnNames(i))
}

// Show DataFrame
df.show()

In this example,

The output of the code above should be:


// Output:
+------+--------+-------+
|new_id|new_name|new_age|
+------+--------+-------+
|     1|    John|     20|
|     2|    Jane|     25|
|     3|     Jim|     30|
+------+--------+-------+

4. Other Spark Column Operations

In Spark, a column refers to a logical data structure representing a named expression that produces a value for each record in a DataFrame. Columns are the building blocks for constructing DataFrame transformations and manipulations in Spark.

To work with columns in Spark Scala, you can use the org.apache.spark.sql.functions package. This package provides many built-in functions for manipulating and transforming columns in a DataFrame.

Here are some common operations you can perform on columns in Spark Scala:

  1. Selecting Columns: To select one or more columns from a DataFrame, you can use the select function. For example, to select columns col1 and col2 from a DataFrame df, you can write df.select("col1", "col2").
  2. Filtering Rows: To filter rows based on a condition, you can use the filter() or where() function. For example, to filter rows where the value in the col1 column is greater than 10, you can write df.filter(col("col1") > 10).
  3. Adding Columns: To add a new column to a DataFrame, you can use the withColumn() function. For example, to add a new column new_col that is the sum of col1 and col2, you can write df.withColumn("new_col", col("col1") + col("col2")).
  4. Renaming Columns: To rename a column in a DataFrame, you can use the withColumnRenamed() function. For example, to rename a column col1 to new_col1, you can write df.withColumnRenamed("col1", "new_col1").
  5. Aggregating Data: To aggregate data based on one or more columns, you can use the groupBy() function. For example, to group data by col1 column and compute the sum of the col2 column for each group, you can write df.groupBy("col1").agg(sum("col2")).

These are just a few examples of what you can do with columns in Spark Scala. The org.apache.spark.sql.functions package provides many more functions for manipulating and transforming columns, so it’s worth exploring the documentation to learn more.

5. Conclusion

In this article, you have learned different ways of renaming multiple columns in Spark, some approaches involve explicitly specifying the new names for each column using the withColumnRenamed() function or passing a list of old and new column names to the toDF() method. It all depends on the requirement from the action standpoint.

Related Articles

Exit mobile version