To remove duplicate columns in Polars, you need to identify the columns with identical values across all rows and retain only the unique ones. Since Polars doesn’t offer a built-in function like drop_duplicates()
for columns, you’ll need to apply different techniques to filter out the duplicates. This process involves comparing the columns’ data and selecting the distinct ones for the final DataFrame.
Removing duplicate columns in Polars involves identifying and removing columns in a DataFrame that contain identical values across all rows. In this article, I will explain several ways to remove duplicate columns in polars with examples.
Key Points –
- Duplicate columns are detected by comparing their content, not just column names.
- You can compare entire rows of columns using
rows()
orrow(0)
to identify duplicates. - Removing duplicates reduces memory usage, improves performance, and simplifies data analysis tasks.
- Using
hash_rows()
is an efficient way to compare columns by generating hashes of their content. - You can use
transpose()
to treat columns as rows, then applyunique()
to remove duplicates. - The
zip()
function can pair up consecutive columns for direct comparison. - Transposing the DataFrame and using
unique()
allows for easy removal of duplicate columns. - Using
rows()
to iterate through columns and compare their entire values is another approach to identify duplicates.
Usage of Remove Duplicate Columns in Polars
To eliminate duplicate columns from a DataFrame, you can use the select()
method along with unique()
on the column names, ensuring each column appears only once. In Polars, you can identify the columns that are exact duplicates and then select only the unique ones.
Now, let’s create a Polars DataFrame.
import polars as pl
df = pl.DataFrame({
"A": [2, 4, 6],
"B": [3, 5, 7],
"A_dup": [2, 4, 6],
"B_dup": [3, 5, 7],
"C": [5, 10, 15]
})
print("Original DataFrame:\n", df)
Yields below output.
To find duplicates based on the first row of the DataFrame using row(0)
in Polars, you can compare the values in the first row of each column. If the first row’s values are the same in different columns, those columns are duplicates.
# Find duplicates by comparing values in the first row
cols = df.columns
unique_cols = []
seen = set()
for col in cols:
first_row_value = df[col][0] # Get the value from the first row of each column
if first_row_value not in seen:
seen.add(first_row_value)
unique_cols.append(col)
# Select unique columns based on the first row's values
df_unique = df.select(unique_cols)
print("DataFrame with unique columns based on the first row:\n", df_unique)
Here,
row(0)
accesses the first row of the DataFrame.- For each column, we compare the first value in that column with a set of already seen values (seen). If it’s a new value, it’s added to
unique_cols
. - Finally, we select only the unique columns using
df.select(unique_cols)
.
Using rows() to Compare Entire Columns
To use rows()
to compare entire columns in Polars and remove duplicate columns, first extract all rows, transpose them to access column-wise data, then identify and drop columns with identical values.
# Use rows() to compare full column values
seen_values = []
unique_cols = []
for col in df.columns:
col_values = df[col].to_list()
if col_values not in seen_values:
seen_values.append(col_values)
unique_cols.append(col)
# Select only unique columns
df_unique = df.select(unique_cols)
print("\nDataFrame with duplicate columns removed:\n", df_unique)
# Output:
# DataFrame with duplicate columns removed:
# shape: (3, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 2 ┆ 3 ┆ 5 │
│ 4 ┆ 5 ┆ 10 │
│ 6 ┆ 7 ┆ 15 │
└─────┴─────┴─────┘
Here,
to_list()
is used to get all values from each column.- We compare the full list of values for each column to those we’ve already seen.
- If it’s new, we keep it; otherwise, we skip it.
Using Hashing via hash_rows() on Each Column
If you want to remove duplicate columns in Polars by comparing their contents using hashing, the most efficient way is to use hash_rows()
on each column individually. This avoids row-by-row comparison and speeds up detection.
# Use hashing to detect duplicate columns
hashes = []
unique_cols = []
for col in df.columns:
col_hash = pl.DataFrame({col: df[col]}).hash_rows().to_list()
if col_hash not in hashes:
hashes.append(col_hash)
unique_cols.append(col)
df_unique = df.select(unique_cols)
print("DataFrame with duplicate columns removed using hash_rows():\n", df_unique)
# Output:
# DataFrame with duplicate columns removed using hash_rows():
# shape: (3, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 2 ┆ 3 ┆ 5 │
│ 4 ┆ 5 ┆ 10 │
│ 6 ┆ 7 ┆ 15 │
└─────┴─────┴─────┘
Here,
hash_rows()
hashes each row of the column.- We then sum the hashes to get a content fingerprint for the column.
Using transpose() + unique() to Find Unique Columns
By using transpose()
together with unique()
in Polars, you can identify and retain unique columns based on their content. This method treats columns as rows, checks for uniqueness, and then transposes the result back to the original structure
# Transpose the DataFrame (columns become rows)
df_transposed = df.transpose()
# Use unique() to find unique rows (columns in original DataFrame)
df_unique_transposed = df_transposed.unique()
# Transpose back to original form to get the DataFrame with unique columns
df_unique = df_unique_transposed.transpose()
print("DataFrame with unique columns (via transpose + unique):\n", df_unique)
# Transpose the DataFrame, use .unique()
#Tto remove duplicate rows (which were originally columns), and transpose back
df_unique = df.transpose().unique().transpose()
print("DataFrame with unique columns (via transpose + unique):\n", df_unique)
# Output:
# DataFrame with unique columns (via transpose + unique):
# shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════════╪══════════╪══════════╡
│ 5 ┆ 3 ┆ 2 │
│ 10 ┆ 5 ┆ 4 │
│ 15 ┆ 7 ┆ 6 │
└──────────┴──────────┴──────────┘
Here,
- Transpose (
df.transpose()
): This converts the columns into rows. - Unique (
unique()
): This removes rows that are exact duplicates (which were originally columns). - Transpose again: After filtering for unique rows, we transpose the DataFrame back to its original column format.
Using Map and Zip to Compare Columns Directly
You can also use the zip
function combined with map
to compare columns directly in Polars. This approach involves pairing up the columns and then comparing them to remove duplicates
# Using zip and map to compare columns and keep unique ones
columns = df.columns
unique_columns = []
# Iterate over the columns and compare using map and zip
for col1, col2 in zip(columns, columns[1:]):
if not any(pl.Series(df[col1]).to_list() == pl.Series(df[c2]).to_list() for c2 in unique_columns):
unique_columns.append(col1)
# Select only unique columns
df_unique = df.select(unique_columns)
print("DataFrame with duplicate columns removed using zip + map:\n", df_unique)
# Output:
# DataFrame with duplicate columns removed using zip + map:
# shape: (3, 2)
┌─────┬─────┐
│ A ┆ B │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2 ┆ 3 │
│ 4 ┆ 5 │
│ 6 ┆ 7 │
└─────┴─────┘
Here,
zip(columns, columns[1:])
creates pairs of consecutive columns for comparison, pairing the first column with the second, the second with the third, and so on.- Using
map()
(implemented through list comprehension in this case), we convert each column into a list and compare its values against those that have already been identified as unique. - Column selection: The columns that are identified as unique are then selected based on the comparison.
Conclusion
In conclusion, by using methods like transpose()
, hash_rows()
, or zip()
combined with map()
, you can efficiently identify and remove duplicate columns in Polars. These techniques allow you to compare columns based on their content, ensuring that you retain only the unique ones.
Happy Learning!!
Related Articles
- Add a New Column into an Existing Polars DataFrame
- Conditional Assignment in Polars DataFrame
- How to Update the Polars DataFrame
- Make a Constant Column in Polars
- Extract Value of Polars Literal
- Check if any Value in a Polars DataFrame is True
- Polars Counting Elements in List Column
- Convert Polars Casting a Column to Decimal
- Polars Looping Through the Rows in a Dataset
- Removing Null Values on Selected Columns only in Polars DataFrame