PySpark DataFrame groupBy and Sort by Descending Order

  • Post author:
  • Post category:PySpark
  • Post last modified:October 26, 2023
  • Reading time:6 mins read

PySpark DataFrame groupBy(), filter(), and sort() – In this PySpark example, let’s see how to do the following operations in sequence 1) DataFrame group by using aggregate function sum(), 2) filter() the group by result, and 3) sort() or orderBy() to do descending or ascending order.

In order to demonstrate all these operations together, let’s create a PySpark DataFrame.


simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NV",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","DE",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","NV",80000,25,18000),
    ("Kumar","Marketing","NJ",91000,50,21000)
  ]

schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)

Using DataFrame groupBy(), filter() and sort()

Below is a complete PySpark DataFrame example of how to do group by, filter, and sort by descending order.


from pyspark.sql.functions import sum, col, desc
df.groupBy("state") \
  .agg(sum("salary").alias("sum_salary")) \
  .filter(col("sum_salary") > 100000)  \
  .sort(desc("sum_salary")) \
  .show()

#+-----+----------+
#|state|sum_salary|
#+-----+----------+
#|   NY|    252000|
#|   CA|    171000|
#|   NV|    166000|
#+-----+----------+

Alternatively, you can use the following SQL expression to achieve the same result.


df.createOrReplaceTempView("EMP")
spark.sql("select state, sum(salary) as sum_salary from EMP " +
          "group by state having sum_salary > 100000 " + 
          "order by sum_salary desc").show()

Below is an Explanation of First DataFrame Example.

First, let’s do a PySpark groupBy() on Dataframe by using an aggregate function sum("salary"), groupBy() returns GroupedData object which contains aggregate functions like sum(), max(), min(), avg(), mean(), count().


df.groupBy("state").sum("salary").show()
#+-----+-----------+
#|state|sum(salary)|
#+-----+-----------+
#|   NJ|      91000|
#|   NV|     166000|
#|   CA|     171000|
#|   DE|      99000|
#|   NY|     252000|
#+-----+-----------+

In the above example, you cannot give an alias name to the aggregate column and the column name would be defaults to agg function name and column name (sum("salary")). This default name is not user friendly hence let’s see how to provide an alias by using agg() function. Also, to do filter and sort it’s better if you know the exact column name.


# Group by using by giving alias name.
from pyspark.sql.functions import sum
dfGroup=df.groupBy("department") \
          .agg(sum("bonus").alias("sum_salary"))

#+-----+----------+
#|state|sum_salary|
#+-----+----------+
#|NJ   |91000     |
#|NV   |166000    |
#|CA   |171000    |
#|DE   |99000     |
#|NY   |252000    |
#+-----+----------+

Here, I have used the PySpark SQL function sum() that returns a Column type and uses alias() of this class. Now let’s see how to filter the group by data. The below example selects the data that has a total sum of a salary greater than 100,000.


# Filter after group by
dfFilter=dfGroup.filter(dfGroup.sum_salary > 100000)
dfFilter.show()

#+-----+----------+
#|state|sum_salary|
#+-----+----------+
#|   NV|    166000|
#|   CA|    171000|
#|   NY|    252000|
#+-----+----------+

Now, let’s use DataFrame sort() transformation to sort on group by column, by default it does sort by ascending order.


# Sory by on group by column
from pyspark.sql.functions import asc
dfFilter.sort("sum_salary").show()

#+-----+----------+
#|state|sum_salary|
#+-----+----------+
#|   NV|    166000|
#|   CA|    171000|
#|   NY|    252000|
#+-----+----------+

In order to descending order, use Spark SQL function desc().


# Sort by descending order.
from pyspark.sql.functions import desc
dfFilter.sort(desc("sum_salary")).show()

#+-----+----------+
#|state|sum_salary|
#+-----+----------+
#|   NY|    252000|
#|   CA|    171000|
#|   NV|    166000|
#+-----+----------+

Conclusion

In this PySpark example I have explained how to do DataFrame groupby(), filter() and sort() by descending order. hope you like it.

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply