• Post author:
  • Post category:Polars
  • Post last modified:January 3, 2025
  • Reading time:14 mins read

In Polars, the unique() function is used to return a DataFrame with unique rows, based on specific columns or the entire DataFrame. It allows you to control which duplicates to keep and whether to maintain the original row order.

Advertisements

In this article, I will explain the Polars DataFrame.unique() function, covering its syntax, parameters, and usage, to show how to generate a new DataFrame with duplicates removed based on the specified subset and keep policy.

Key Points –

  • The unique() function is used to return a DataFrame with unique rows, based on specific column(s) or the entire DataFrame.
  • You can specify a subset of columns to consider for uniqueness using the subset parameter, or leave it as None to consider all columns.
  • The keep parameter allows you to control which duplicates to keep: 'first', 'last', 'any', or None.
  • The maintain_order parameter, when set to True, ensures that the original order of rows is preserved after filtering for uniqueness.
  • You can pass a list of column names to the subset parameter to check for uniqueness based on a combination of multiple columns.
  • Polars is optimized for high-performance computing, making unique() efficient for large datasets.
  • The function returns a new DataFrame containing only the unique rows, without modifying the original DataFrame.
  • By default, the order of rows may change after filtering for uniqueness unless maintain_order=True is set.

Syntax of Polars DataFrame.unique()

Let’s know the syntax of the Polars DataFrame unique() function.


# Syntax of unique()
DataFrame.unique(
    subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None,
    *,
    keep: UniqueKeepStrategy = 'any',
    maintain_order: bool = False,
) → DataFrame

Parameters of the Polars DataFrame.unique()

Following are the parameters of the polars DataFrame.unique() function.

  • subset – Specifies which columns to consider when determining uniqueness. If None, all columns are used. A single column name (string), a list of column names, or None (default).
  • keep – Determines which duplicate to keep:
    • 'first' – Keeps the first occurrence of each unique row.
    • 'last' – Keeps the last occurrence of each unique row.
    • 'any' – Keeps any occurrence (default behavior).
    • None – Drops all duplicates (essentially does a distinct() operation).
  • maintain_order – If True, preserves the original order of rows. If False (default), the order of rows may change after the operation.

Return Value

This function returns a new DataFrame with unique rows, based on the specified parameters.

Usage of Polars DataFrame.unique() Function

The unique() function returns a new DataFrame containing unique rows based on specified columns or the entire DataFrame. Duplicate rows are removed based on the defined criteria.

To run some examples of the Polars DataFrame.unique() function, let’s create a Pandas DataFrame.


import polars as pl

technologies = {
    'Courses':["Spark","PySpark","Python","pandas","Python","Spark","pandas"],
    'Fee' :[20000,25000,22000,30000,22000,20000,30000],
    'Duration':['30days','40days','35days','50days','40days','30days','50days'],
    'Discount':[1000,2300,1200,2000,2300,1000,2000]
              }
df = pl.DataFrame(technologies)
print("Original DataFrame:\n", df)

Yields below output.

polars unique

The unique() function in Polars can be used to get unique rows from the entire DataFrame. By default, it considers all columns when determining uniqueness, and it keeps the first occurrence of each unique row.


# Get unique rows based on all columns
df2 = df.unique()
print("Get unique rows based on all columns:\n",df2)

In the above example, the resulting DataFrame will contain unique rows based on all the columns, removing any duplicate rows that have the exact same values across all columns. For instance, if multiple rows share the same values in “Courses,” “Fee,” “Duration,” and “Discount,” only one occurrence of such a row will be retained in the output.

polars unique

Unique Rows Based on a Single Column

To find the unique rows in a Polars DataFrame based on a single column, you can use the unique() method, specifying the column name. This ensures that only one occurrence of each unique value in the specified column is retained.


# Unique rows based on the 'Courses' column
result = df.unique(subset="Courses")
print("Unique Rows Based on 'Courses':\n", result)

# Output:
# Unique Rows Based on 'Courses':
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ pandas  ┆ 30000 ┆ 50days   ┆ 2000     │
│ PySpark ┆ 25000 ┆ 40days   ┆ 2300     │
│ Spark   ┆ 20000 ┆ 30days   ┆ 1000     │
│ Python  ┆ 22000 ┆ 35days   ┆ 1200     │
└─────────┴───────┴──────────┴──────────┘

In the above example, the resulting DataFrame will retain only one row for each unique value in the “Courses” column. Duplicate rows with the same “Courses” value will be removed, but other columns will retain their values from the retained row.

Maintain Original Order of Rows

Alternatively, to maintain the original order of rows when retrieving unique rows from a DataFrame, you can set the maintain_order parameter to True. This ensures that the rows remain in the same order as in the original DataFrame, even after removing duplicates.


# Get unique rows based on 'Courses' while maintaining the original order
result = df.unique(subset=["Courses"], maintain_order=True)
print("Unique Rows with Original Order Maintained:\n", result)

# Output:
# Unique Rows with Original Order Maintained:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 20000 ┆ 30days   ┆ 1000     │
│ PySpark ┆ 25000 ┆ 40days   ┆ 2300     │
│ Python  ┆ 22000 ┆ 35days   ┆ 1200     │
│ pandas  ┆ 30000 ┆ 50days   ┆ 2000     │
└─────────┴───────┴──────────┴──────────┘

Keep the First Occurrence of Duplicates

To retain only the first occurrence of duplicates in a Polars DataFrame based on a specific column, use the unique(keep="first") method. This ensures that only the first row for each duplicate value in the specified column is kept, while others are discarded.


# Retain only the first occurrence of duplicates based on the 'Courses' column
result = df.unique(subset="Courses", keep="first")
print("DataFrame with First Occurrence of Duplicates:\n", result)

# Output:
# DataFrame with First Occurrence of Duplicates:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Python  ┆ 22000 ┆ 35days   ┆ 1200     │
│ PySpark ┆ 25000 ┆ 40days   ┆ 2300     │
│ Spark   ┆ 20000 ┆ 30days   ┆ 1000     │
│ pandas  ┆ 30000 ┆ 50days   ┆ 2000     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • subset="Courses": Specifies the column (Courses) to check for duplicates.
  • keep="first": Retains the first occurrence of each duplicate value and removes subsequent occurrences.

Keep the Last Occurrence of Duplicates

To retain only the last occurrence of duplicates in a Polars DataFrame based on a specific column, use the unique(keep="last") method. This ensures that only the last row for each duplicate value in the specified column is kept.


# Retain only the last occurrence of duplicates based on the 'Courses' column
result = df.unique(subset="Courses", keep="last")
print("DataFrame with Last Occurrence of Duplicates:\n", result)

# Output:
# DataFrame with Last Occurrence of Duplicates:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ pandas  ┆ 30000 ┆ 50days   ┆ 2000     │
│ PySpark ┆ 25000 ┆ 40days   ┆ 2300     │
│ Spark   ┆ 20000 ┆ 30days   ┆ 1000     │
│ Python  ┆ 22000 ┆ 40days   ┆ 2300     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • subset="Courses": Specifies the column (Courses) to check for duplicates.
  • keep="last": Retains the last occurrence of each duplicate value and removes earlier occurrences.

Unique Rows Based on Multiple Columns

To get unique rows based on multiple columns in a Polars DataFrame, you can use the unique() function with the subset parameter, where you specify a list of columns you want to consider for determining uniqueness.


# Get unique rows based on 'Courses' and 'Fee' columns
result = df.unique(subset=["Courses", "Fee"])
print("Unique rows based on multiple columns:\n", result)

# Output:
# Unique rows based on multiple columns:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ pandas  ┆ 30000 ┆ 50days   ┆ 2000     │
│ PySpark ┆ 25000 ┆ 40days   ┆ 2300     │
│ Spark   ┆ 20000 ┆ 30days   ┆ 1000     │
│ Python  ┆ 22000 ┆ 35days   ┆ 1200     │
└─────────┴───────┴──────────┴──────────┘

Similarly, to get unique rows based on the ‘Courses’ and ‘Fee’ columns while maintaining the original order in Polars.


# Get unique rows based on 'Courses' and 'Fee' columns 
# While maintaining the original order
result = df.unique(subset=["Courses", "Fee"], maintain_order=True)
print("Unique rows based on multiple columns with original order maintained:\n", result)

# Output:
# Unique rows based on multiple columns with original order maintained:
# shape: (4, 4)
┌─────────┬───────┬──────────┬──────────┐
│ Courses ┆ Fee   ┆ Duration ┆ Discount │
│ ---     ┆ ---   ┆ ---      ┆ ---      │
│ str     ┆ i64   ┆ str      ┆ i64      │
╞═════════╪═══════╪══════════╪══════════╡
│ Spark   ┆ 20000 ┆ 30days   ┆ 1000     │
│ PySpark ┆ 25000 ┆ 40days   ┆ 2300     │
│ Python  ┆ 22000 ┆ 35days   ┆ 1200     │
│ pandas  ┆ 30000 ┆ 50days   ┆ 2000     │
└─────────┴───────┴──────────┴──────────┘

Here,

  • The subset=["Courses", "Fee"] ensures that uniqueness is determined based on the combination of values in the “Courses” and “Fee” columns.
  • The maintain_order=True ensures that the original row order is preserved in the resulting DataFrame.

Conclusion

In conclusion, the Polars DataFrame.unique() function is a powerful tool for handling duplicate rows in a DataFrame. By understanding its syntax, parameters, and usage, you can efficiently filter unique rows based on specific columns or the entire DataFrame. The flexibility of the subset and keep parameters allows you to tailor the function to your needs, whether you’re preserving the first or last occurrence of duplicates.

Happy Learning!!

References