• Post author:
  • Post category:Polars
  • Post last modified:April 30, 2025
  • Reading time:12 mins read
You are currently viewing How to Remove Duplicate Columns in Polars?

To remove duplicate columns in Polars, you need to identify the columns with identical values across all rows and retain only the unique ones. Since Polars doesn’t offer a built-in function like drop_duplicates() for columns, you’ll need to apply different techniques to filter out the duplicates. This process involves comparing the columns’ data and selecting the distinct ones for the final DataFrame.

Advertisements

Removing duplicate columns in Polars involves identifying and removing columns in a DataFrame that contain identical values across all rows. In this article, I will explain several ways to remove duplicate columns in polars with examples.

Key Points –

  • Duplicate columns are detected by comparing their content, not just column names.
  • You can compare entire rows of columns using rows() or row(0) to identify duplicates.
  • Removing duplicates reduces memory usage, improves performance, and simplifies data analysis tasks.
  • Using hash_rows() is an efficient way to compare columns by generating hashes of their content.
  • You can use transpose() to treat columns as rows, then apply unique() to remove duplicates.
  • The zip() function can pair up consecutive columns for direct comparison.
  • Transposing the DataFrame and using unique() allows for easy removal of duplicate columns.
  • Using rows() to iterate through columns and compare their entire values is another approach to identify duplicates.

Usage of Remove Duplicate Columns in Polars

To eliminate duplicate columns from a DataFrame, you can use the select() method along with unique() on the column names, ensuring each column appears only once. In Polars, you can identify the columns that are exact duplicates and then select only the unique ones.

Now, let’s create a Polars DataFrame.


import polars as pl

df = pl.DataFrame({
    "A": [2, 4, 6],
    "B": [3, 5, 7],
    "A_dup": [2, 4, 6],   
    "B_dup": [3, 5, 7],   
    "C": [5, 10, 15]
})
print("Original DataFrame:\n", df)

Yields below output.

polars remove duplicate columns

To find duplicates based on the first row of the DataFrame using row(0) in Polars, you can compare the values in the first row of each column. If the first row’s values are the same in different columns, those columns are duplicates.


# Find duplicates by comparing values in the first row
cols = df.columns
unique_cols = []
seen = set()

for col in cols:
    first_row_value = df[col][0]  # Get the value from the first row of each column
    if first_row_value not in seen:
        seen.add(first_row_value)
        unique_cols.append(col)

# Select unique columns based on the first row's values
df_unique = df.select(unique_cols)
print("DataFrame with unique columns based on the first row:\n", df_unique)

Here,

  • row(0) accesses the first row of the DataFrame.
  • For each column, we compare the first value in that column with a set of already seen values (seen). If it’s a new value, it’s added to unique_cols.
  • Finally, we select only the unique columns using df.select(unique_cols).
polars remove duplicate columns

Using rows() to Compare Entire Columns

To use rows() to compare entire columns in Polars and remove duplicate columns, first extract all rows, transpose them to access column-wise data, then identify and drop columns with identical values.


# Use rows() to compare full column values
seen_values = []
unique_cols = []

for col in df.columns:
    col_values = df[col].to_list()
    if col_values not in seen_values:
        seen_values.append(col_values)
        unique_cols.append(col)

# Select only unique columns
df_unique = df.select(unique_cols)
print("\nDataFrame with duplicate columns removed:\n", df_unique)

# Output:
# DataFrame with duplicate columns removed:
# shape: (3, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 2   ┆ 3   ┆ 5   │
│ 4   ┆ 5   ┆ 10  │
│ 6   ┆ 7   ┆ 15  │
└─────┴─────┴─────┘

Here,

  • to_list() is used to get all values from each column.
  • We compare the full list of values for each column to those we’ve already seen.
  • If it’s new, we keep it; otherwise, we skip it.

Using Hashing via hash_rows() on Each Column

If you want to remove duplicate columns in Polars by comparing their contents using hashing, the most efficient way is to use hash_rows() on each column individually. This avoids row-by-row comparison and speeds up detection.


# Use hashing to detect duplicate columns
hashes = []
unique_cols = []

for col in df.columns:
    col_hash = pl.DataFrame({col: df[col]}).hash_rows().to_list()
    if col_hash not in hashes:
        hashes.append(col_hash)
        unique_cols.append(col)

df_unique = df.select(unique_cols)
print("DataFrame with duplicate columns removed using hash_rows():\n", df_unique)

# Output:
# DataFrame with duplicate columns removed using hash_rows():
# shape: (3, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 2   ┆ 3   ┆ 5   │
│ 4   ┆ 5   ┆ 10  │
│ 6   ┆ 7   ┆ 15  │
└─────┴─────┴─────┘

Here,

  • hash_rows() hashes each row of the column.
  • We then sum the hashes to get a content fingerprint for the column.

Using transpose() + unique() to Find Unique Columns

By using transpose() together with unique() in Polars, you can identify and retain unique columns based on their content. This method treats columns as rows, checks for uniqueness, and then transposes the result back to the original structure


# Transpose the DataFrame (columns become rows)
df_transposed = df.transpose()

# Use unique() to find unique rows (columns in original DataFrame)
df_unique_transposed = df_transposed.unique()

# Transpose back to original form to get the DataFrame with unique columns
df_unique = df_unique_transposed.transpose()
print("DataFrame with unique columns (via transpose + unique):\n", df_unique)


# Transpose the DataFrame, use .unique() 
#Tto remove duplicate rows (which were originally columns), and transpose back
df_unique = df.transpose().unique().transpose()
print("DataFrame with unique columns (via transpose + unique):\n", df_unique)

# Output:
# DataFrame with unique columns (via transpose + unique):
# shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ i64      ┆ i64      ┆ i64      │
╞══════════╪══════════╪══════════╡
│ 5        ┆ 3        ┆ 2        │
│ 10       ┆ 5        ┆ 4        │
│ 15       ┆ 7        ┆ 6        │
└──────────┴──────────┴──────────┘

Here,

  • Transpose (df.transpose()): This converts the columns into rows.
  • Unique (unique()): This removes rows that are exact duplicates (which were originally columns).
  • Transpose again: After filtering for unique rows, we transpose the DataFrame back to its original column format.

Using Map and Zip to Compare Columns Directly

You can also use the zip function combined with map to compare columns directly in Polars. This approach involves pairing up the columns and then comparing them to remove duplicates


# Using zip and map to compare columns and keep unique ones
columns = df.columns
unique_columns = []

# Iterate over the columns and compare using map and zip
for col1, col2 in zip(columns, columns[1:]):
    if not any(pl.Series(df[col1]).to_list() == pl.Series(df[c2]).to_list() for c2 in unique_columns):
        unique_columns.append(col1)

# Select only unique columns
df_unique = df.select(unique_columns)
print("DataFrame with duplicate columns removed using zip + map:\n", df_unique)

# Output:
# DataFrame with duplicate columns removed using zip + map:
# shape: (3, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 3   │
│ 4   ┆ 5   │
│ 6   ┆ 7   │
└─────┴─────┘

Here,

  • zip(columns, columns[1:]) creates pairs of consecutive columns for comparison, pairing the first column with the second, the second with the third, and so on.
  • Using map() (implemented through list comprehension in this case), we convert each column into a list and compare its values against those that have already been identified as unique.
  • Column selection: The columns that are identified as unique are then selected based on the comparison.

Conclusion

In conclusion, by using methods like transpose(), hash_rows(), or zip() combined with map(), you can efficiently identify and remove duplicate columns in Polars. These techniques allow you to compare columns based on their content, ensuring that you retain only the unique ones.

Happy Learning!!

Reference