PySpark show() – Display DataFrame Contents in Table

PySpark DataFrame show() is used to display the contents of the DataFrame in a Table Row & Column Format. By default, it shows only 20 Rows, and the column values are truncated at 20 characters. 1. PySpark DataFrame show() Syntax & Example 1.1 Syntax def show(self, n=20, truncate=True, vertical=False): 1.2…

Continue Reading PySpark show() – Display DataFrame Contents in Table

Spark show() – Display DataFrame Contents in Table

Spark/PySpark DataFrame show() is used to display the contents of the DataFrame in a Table Row & Column Format. By default it shows only 20 Rows and the column values are truncated at 20 characters. 1. Spark DataFrame show() Syntax & Example 1.1 Syntax def show() def show(numRows : scala.Int)…

Continue Reading Spark show() – Display DataFrame Contents in Table

Find Maximum Row per Group in Spark DataFrame

In Spark, find/select maximum (max) row per group can be calculated using window partitionBy() function and running row_number() function over window partition, let's see with a DataFrame example. 1. Prepare Data & DataFrame First, let's Create Spark DataFrame with 3 columns employee_name, department and salary. Column department contains different departments…

Continue Reading Find Maximum Row per Group in Spark DataFrame

Spark Get Current Number of Partitions of DataFrame

While working with Spark/PySpark we often need to know the current number of partitions on DataFrame/RDD as changing the size/length of the partition is one of the key factors to improve Spark/PySpark job performance, in this article let's learn how to get the current partitions count/size with examples. Related: How…

Continue Reading Spark Get Current Number of Partitions of DataFrame

Spark Check Column Present in DataFrame

You can get all columns of a DataFrame as an Array[String] by using columns attribute of Spark DataFrame and use this with Scala Array functions to check if a column/field present in DataFrame, In this article I will also cover how to check if a column present/exists in nested column…

Continue Reading Spark Check Column Present in DataFrame

PySpark Select Top N Rows From Each Group

In PySpark Find/Select Top N rows from each group can be calculated by partition the data by window using Window.partitionBy() function, running row_number() function over the grouped partition, and finally filter the rows to get top N rows, let’s see with a DataFrame example. Below is a quick snippet that…

Continue Reading PySpark Select Top N Rows From Each Group

PySpark Find Maximum Row per Group in DataFrame

In PySpark, find/select maximum (max) row per group can be calculated using Window.partitionBy() function and running row_number() function over window partition, let's see with a DataFrame example. 1. Prepare Data & DataFrame First, let's create the PySpark DataFrame with 3 columns employee_name, department and salary. Column department contains different departments…

Continue Reading PySpark Find Maximum Row per Group in DataFrame

PySpark Select First Row of Each Group?

In PySpark select/find the first row of each group within a DataFrame can be get by grouping the data using window partitionBy() function and running row_number() function over window partition. let's see with an example. 1. Prepare Data & DataFrame Before we start let's create the PySpark DataFrame with 3…

Continue Reading PySpark Select First Row of Each Group?

PySpark Check Column Exists in DataFrame

Problem: I have a PySpark DataFrame and I would like to check if a column exists in the DataFrame schema, could you please explain how to do it? Also, I have a need to check if DataFrame columns present in the list of strings. 1. Solution: PySpark Check if Column…

Continue Reading PySpark Check Column Exists in DataFrame