PySpark Column alias after groupBy() Example

Problem: In PySpark, I would like to give a DataFrame column alias/rename column after groupBy(), I have the following Dataframe and have done a group by operation but I am not seeing an option to rename the aggregated column. By default, it is providing a column name as an aggregate function name with the column name used. for example sum(salary)


from pyspark.sql import SparkSession
spark = SparkSession.builder \
   .appName('SparkByExamples.com') \
   .getOrCreate()
simpleData = [("James","Sales","NY",90000,34,10000),
  ("Michael","Sales","NV",86000,56,20000),
  ("Robert","Sales","CA",81000,30,23000),
  ("Maria","Finance","CA",90000,24,23000),
  ("Raman","Finance","DE",99000,40,24000),
  ("Scott","Finance","NY",83000,36,19000),
  ("Jen","Finance","NY",79000,53,15000),
  ("Jeff","Marketing","NV",80000,25,18000),
  ("Kumar","Marketing","NJ",91000,50,21000)
]
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
df.groupBy("state").sum("salary").show()

This outputs


+-----+-----------+
|state|sum(salary)|
+-----+-----------+
|   NJ|      91000|
|   NV|     166000|
|   CA|     171000|
|   DE|      99000|
|   NY|     252000|
+-----+-----------+

I would like to rename sum(salary) to sum_salary.

Solution – PySpark Column alias after groupBy()

In PySpark, the approach you are using above don’t have an option to rename/alias a Column after groupBy() aggregation but there are many other ways to give a column alias for groupBy() agg column, let’s see them with examples (same can be used for Spark with Scala). Use the one that fit’s your need.

Use sum() Function and alias()

Use sum() SQL function to perform summary aggregation that returns a Column type, and use alias() of Column type to rename a DataFrame column. alias() takes a string argument representing a column name you wanted. Below example renames column name to sum_salary.


from pyspark.sql.functions import sum
df.groupBy("state") \
  .agg(sum("salary").alias("sum_salary"))

Use withColumnRenamed() to Rename groupBy()

Another best approach would be to use PySpark DataFrame withColumnRenamed() operation to alias/rename a column of groupBy() result. Use the existing column name as the first argument to this operation and the second argument with the column name you want.


df.groupBy("state") \
  .sum("salary") \
  .withColumnRenamed("sum(salary)", "sum_salary") \
  .show()

Use select() Transformation

You can also use select(), one disadvantage of using this is you need to provide all column names you wanted on the resulting DataFrame. if you want many columns on the new DataFrame then it’s hard to use this approach as you need to list all column names.


from pyspark.sql.functions import col
df.groupBy("state") \
  .sum("salary") \
  .select(col("state"),col("sum(salary)").alias("sum_salary")) \
  .show()

Use SQL Expression for groupBy()

Another best approach is to use Spark SQL after creating a temporary view, with this you can provide a alias to groupby() aggregation column similar to SQL expression.


df.createOrReplaceTempView("EMP")
spark.sql("select state, sum(salary) as sum_salary from EMP " +
          "group by state").show()

Conclusion

In this article, I have covered rename/alias of a groupBy() aggregation column can be done using Column alias() function, withColumnRenamed() and finally using Spark SQL expression.

Happy Learning !!

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply