PySpark Select First Row of Each Group?

In PySpark select/find the first row of each group within a DataFrame can be get by grouping the data using window partitionBy() function and running row_number() function over window partition. let's see with an example. 1. Prepare Data & DataFrame Before we start let's create the PySpark DataFrame with 3…

Continue Reading PySpark Select First Row of Each Group?

PySpark repartition() vs partitionBy()

Let's learn what is the difference between PySpark repartition() vs partitionBy() with examples. PySpark repartition() is a DataFrame method that is used to increase or reduce the partitions in memory and when written to disk, it create all part files in a single directory. PySpark partitionBy() is a method of…

Continue Reading PySpark repartition() vs partitionBy()

PySpark partitionBy() – Write to Disk Example

PySpark partitionBy() is a function of pyspark.sql.DataFrameWriter class which is used to partition the large dataset (DataFrame) into smaller files based on one or multiple columns while writing to disk, let's see how to use this with Python examples. Partitioning the data on the file system is a way to…

Continue Reading PySpark partitionBy() – Write to Disk Example

Spark Read ORC file into DataFrame

Spark natively supports ORC data source to read ORC into DataFrame and write it back to the ORC file format using orc() method of DataFrameReader and DataFrameWriter. In this article, I will explain how to read an ORC file into Spark DataFrame, proform some filtering, creating a table by reading…

Continue Reading Spark Read ORC file into DataFrame

Spark – Read & Write Avro files from Amazon S3

In this Spark tutorial, you will learn what is Avro format, It’s advantages and how to read the Avro file from Amazon S3 bucket into Dataframe and write DataFrame in Avro file to Amazon S3 bucket with Scala example. Spark provides built-in support to read from and write DataFrame to…

Continue Reading Spark – Read & Write Avro files from Amazon S3

Spark DataFrame Select First Row of Each Group?

In this Spark article, I've explained how to select/get the first row, min (minimum), max (maximum) of each group in DataFrame using Spark SQL window functions and Scala example. Though I've explained here with Scala, the same method could be used to working with PySpark and Python. 1. Preparing Data…

Continue Reading Spark DataFrame Select First Row of Each Group?

Read & Write Avro files using Spark DataFrame

Spark provides built-in support to read from and write DataFrame to Avro file using "spark-avro" library. In this tutorial, you will learn reading and writing Avro file along with schema, partitioning data for performance with Scala example. If you are using Spark 2.3 or older then please use this URL.…

Continue Reading Read & Write Avro files using Spark DataFrame