PySpark Column alias after groupBy() Example

  • Post author:
  • Post category:PySpark
  • Post last modified:February 19, 2023
  • Reading time:6 mins read

Problem: In PySpark, I would like to give a DataFrame column alias/rename column after groupBy(), I have the following Dataframe and have done a group by operation but I am not seeing an option to rename the aggregated column. By default, it is providing a column name as an aggregate function name with the column name used. for example sum(salary)


from pyspark.sql import SparkSession
spark = SparkSession.builder \
   .appName('SparkByExamples.com') \
   .getOrCreate()
simpleData = [("James","Sales","NY",90000,34,10000),
  ("Michael","Sales","NV",86000,56,20000),
  ("Robert","Sales","CA",81000,30,23000),
  ("Maria","Finance","CA",90000,24,23000),
  ("Raman","Finance","DE",99000,40,24000),
  ("Scott","Finance","NY",83000,36,19000),
  ("Jen","Finance","NY",79000,53,15000),
  ("Jeff","Marketing","NV",80000,25,18000),
  ("Kumar","Marketing","NJ",91000,50,21000)
]
schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
df.groupBy("state").sum("salary").show()

This outputs


+-----+-----------+
|state|sum(salary)|
+-----+-----------+
|   NJ|      91000|
|   NV|     166000|
|   CA|     171000|
|   DE|      99000|
|   NY|     252000|
+-----+-----------+

I would like to rename sum(salary) to sum_salary.

Solution – PySpark Column alias after groupBy()

In PySpark, the approach you are using above doesn’t have an option to rename/alias a Column after groupBy() aggregation but there are many other ways to give a column alias for groupBy() agg column, let’s see them with examples (same can be used for Spark with Scala). Use the one that fit’s your need.

1. Use alias()

Use sum() SQL function to perform summary aggregation that returns a Column type, and use alias() of Column type to rename a DataFrame column. alias() takes a string argument representing a column name you wanted. Below example renames column name to sum_salary.


from pyspark.sql.functions import sum
df.groupBy("state") \
  .agg(sum("salary").alias("sum_salary"))

2. Use withColumnRenamed() to Rename groupBy()

Another best approach would be to use PySpark DataFrame withColumnRenamed() operation to alias/rename a column of groupBy() result. Use the existing column name as the first argument to this operation and the second argument with the column name you want.


df.groupBy("state") \
  .sum("salary") \
  .withColumnRenamed("sum(salary)", "sum_salary") \
  .show()

3. Use select() Transformation

You can also use select(), one disadvantage of using this is you need to provide all column names you wanted on the resulting DataFrame. if you want many columns on the new DataFrame then it’s hard to use this approach as you need to list all column names.


from pyspark.sql.functions import col
df.groupBy("state") \
  .sum("salary") \
  .select(col("state"),col("sum(salary)").alias("sum_salary")) \
  .show()

4. Use SQL Expression for groupBy()

Another best approach is to use Spark SQL after creating a temporary view, with this you can provide an alias to groupby() aggregation column similar to SQL expression.


df.createOrReplaceTempView("EMP")
spark.sql("select state, sum(salary) as sum_salary from EMP " +
          "group by state").show()

5. Conclusion

In this article, I have covered rename/alias of a groupBy() aggregation column can be done using Column alias() function, withColumnRenamed() and finally using Spark SQL expression.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply