The select() or selectExpr() transformations can be used to rearrange or change the column position in Spark Dataframe. In Apache Spark, a DataFrame is a distributed collection of data organized into named columns, to access a specific column in a Spark DataFrame, you can use the col function or the $ operator. Sometimes you would be required to change the order of the columns on the DataFrame, for example right before you write a DataFrame to a CSV file.

In this article, I will explain different ways to change the column position in Spark Dataframe, Let’s first create a sample DataFrame.

Note that DataFrame is immutable that cannot be changed hence, the methods explained here create a new DataFrame with the columns in the desired order.

1. Creating Sample DataFrame

To create a DataFrame in Spark Scala with order details, you can use the createDataFrame() function from the SparkSession object. Here’s an example:


// Imports
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}

// Define the schema for the orders DataFrame
val schema = StructType(Seq(
  StructField("order_id", IntegerType),
  StructField("customer_id", IntegerType),
  StructField("product_name", StringType),
  StructField("price", DoubleType),
  StructField("quantity", IntegerType)
))

// Create a list of Row objects with order details
val orders = Seq(
  Row(1, 100, "Product A", 10.0, 2),
  Row(2, 200, "Product B", 20.0, 1),
  Row(3, 100, "Product C", 30.0, 3),
  Row(4, 300, "Product D", 15.0, 2)
)

// Create a SparkSession object
val spark = SparkSession.builder()
                        .appName("sparkbyexamples.com")
                        .master("local[*]")
                        .getOrCreate()

// Create a DataFrame from the schema and the list of Row objects
val rdd = spark.sparkContext.parallelize(orders)
val ordersDF = spark.createDataFrame(rdd, schema)

// Display the orders DataFrame
ordersDF.show()

Here, we created orders DataFrame which consists of five columns: order_id, customer_id, product_name, price, and quantity. The output of the DataFrame looks like this.


// Output:
+--------+-----------+------------+-----+--------+
|order_id|customer_id|product_name|price|quantity|
+--------+-----------+------------+-----+--------+
|       1|        100|   Product A| 10.0|       2|
|       2|        200|   Product B| 20.0|       1|
|       3|        100|   Product C| 30.0|       3|
|       4|        300|   Product D| 15.0|       2|
+--------+-----------+------------+-----+--------+

Changing the column position of a Spark DataFrame in Scala can be done in several ways. For example, you can use the select() method with the names of the columns in a specific order. Let’s use the orders Dataframe that we created above and try out different ways to change the positions, Here are some examples:

2. Using select() function to change column position in Spark

In Spark, the select() function is used to select the rows for the specific columns from the DataFrame, we can use this to reorder or change the position of the columns in DataFrame, which you can easily do by selecting the columns you want in a specific order.


// Imports
import org.apache.spark.sql.functions.col

// Define the new order of columns
val newOrder = Seq("product_name","price","quantity", 
                  "order_id","customer_id")  

// Select columns in the new order
val dfNewOrder = ordersDF.select(newOrder.map(c => col(c)): _*) 

// Display the orders DataFrame with re-ordered columns
dfNewOrder.show()

In this example, we define the new order of columns as a sequence of strings. Then, we use the select() function to select the columns in the new order on the ordersDF, by passing each column name to the col function and then unpacking them with the :_* operator. Finally, we get a new order DataFrame with the columns in the desired order as stated in the newOrder list.

The result after changing the column positions of the Spark DataFrame looks like this:


// Output:
+------------+-----+--------+--------+-----------+
|product_name|price|quantity|order_id|customer_id|
+------------+-----+--------+--------+-----------+
|   Product A| 10.0|       2|       1|        100|
|   Product B| 20.0|       1|       2|        200|
|   Product C| 30.0|       3|       3|        100|
|   Product D| 15.0|       2|       4|        300|
+------------+-----+--------+--------+-----------+

3. Using selectExpr() function to change column position in Spark

Alternatively, you can also use the selectExpr() function to change the order of the columns in DataFrame, this function selects one or more columns from a DataFrame in a specific order.


// Using selectExpr and rename column quantity into product_quantity
val dfNewOrder = ordersDF.selectExpr("product_name", 
                    "price", "quantity", 
                    "order_id", "customer_id")
          .withColumnRenamed("quantity", "product_quantity")

// Display the orders DataFrame with re-ordered columns
dfNewOrder.show()

In this example, we use the selectExpr() function to select the columns in the desired order, by passing the column names as strings. Then, we use the withColumnRenamed() function to rename the column “quantity” to “product_quantity“, which effectively changes its position in the DataFrame.

Though withColumnRenamed() has nothing to do with changing column position, we used it here to rename the column.

The result after changing the column positions of the Spark DataFrame looks like this:


// Output:
+------------+-----+----------------+--------+-----------+
|product_name|price|product_quantity|order_id|customer_id|
+------------+-----+----------------+--------+-----------+
|   Product A| 10.0|               2|       1|        100|
|   Product B| 20.0|               1|       2|        200|
|   Product C| 30.0|               3|       3|        100|
|   Product D| 15.0|               2|       4|        300|
+------------+-----+----------------+--------+-----------+

4. Using select() and alias() functions


// Using select and alias to rename column quantity into product_quantity
val dfNewOrder = ordersDF.select(col("product_name"), col("price"),
                   col("quantity").alias("product_quantity"),
                   col("order_id"), col("customer_id"))

// display the orders DataFrame with re-ordered columns
dfNewOrder.show()

In this example, we use the select() function to select the columns in the desired order, by passing each column to the col function. Then, we use the alias() function to rename the column “quantity” to “product_quantity“, which effectively changes its position in the DataFrame.

The result after changing the column positions of the Spark DataFrame looks like this:


// Output
+------------+-----+----------------+--------+-----------+
|product_name|price|product_quantity|order_id|customer_id|
+------------+-----+----------------+--------+-----------+
|   Product A| 10.0|               2|       1|        100|
|   Product B| 20.0|               1|       2|        200|
|   Product C| 30.0|               3|       3|        100|
|   Product D| 15.0|               2|       4|        300|
+------------+-----+----------------+--------+-----------+

5. Rearrange the columns in ascending order

Sometimes we would be required to rearrange the columns in ascending order, you can easily achieve this by using the Spark select() and Scala sorted function. The sorted is a list function that is used to sort the elements in the list.


// Rearrange columns in ascending order
df.select(df.columns.sorted)

6. Rearrange columns in reverse order

Similarly, you can reverse the columns of the DataFrame by using the scala map() function


// Change columns in reverse order
val cols = df.columns.map(df(_)).reverse
val reversedColDF = df.select(cols:_*)

7. Conclusion

In summary, we can change the column position of a Spark DataFrame using the select() and selectExpr() functions. To move a column to a specific position, we can use the select() function with the columns in the desired order as input. By using this you can change the single or multiple-column positions at a time.

Alternatively, we can use the selectExpr() function to select the desired columns and apply SQL expressions to them when needed.

Related Articles

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.