PySpark Count Distinct from DataFrame

In PySpark, you can use distinct().count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame. By chaining these you can get the count distinct of PySpark DataFrame. countDistinct() is…

Continue Reading PySpark Count Distinct from DataFrame

Spark SQL – Count Distinct from DataFrame

In this Spark SQL tutorial, you will learn different ways to count the distinct values in every column or selected columns of rows in a DataFrame using methods available on DataFrame and SQL function using Scala examples. Before we start, first let's create a DataFrame with some duplicate rows and…

Continue Reading Spark SQL – Count Distinct from DataFrame