• Post author:
  • Post category:Pandas
  • Post last modified:November 26, 2024
  • Reading time:15 mins read
You are currently viewing How to Drop Duplicate Columns in Pandas DataFrame

By using pandas.DataFrame.T.drop_duplicates().T you can drop/remove/delete duplicate columns with the same name or a different name. This method removes all columns of the same name beside the first occurrence of the column and also removes columns that have the same data with a different column name. In this article, I will explain several ways to drop duplicate columns from Pandas DataFrame with examples.

Advertisements

Key Points –

  • Use .T.duplicated() on the transposed DataFrame to identify columns with duplicate values, as this checks each column’s data.
  • Filter columns using DataFrame.loc[:, ~DataFrame.T.duplicated()] to remove duplicate columns and keep only unique ones.
  • The keep='first' parameter in .duplicated() retains the first occurrence of each duplicate column, dropping subsequent duplicates.
  • Set keep='last' in .duplicated() to keep the last occurrence of each duplicate column while dropping earlier ones.
  • Use .duplicated(subset=columns) to check for duplicates within a specific subset of columns, ideal for partial duplication checks.
  • If you only need to drop columns with duplicate names (not content), use DataFrame.loc[:, ~DataFrame.columns.duplicated()].

Related:

Quick Examples of Removing Duplicate Columns in Pandas DataFrame

If you are in a hurry, below are some quick examples of dropping duplicate columns from DataFrame.


# Quick examples of removing duplicate columns

# Example 1: Drop duplicate columns
df2 = df.T.drop_duplicates().T

# Example 2: Use groupby() 
# To drop duplicate columns
df2 = df.T.groupby(level=0).first().T

# Example 3: Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]

# Example 4: Remove repeated columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]

# Example 5: Keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]

# Example 6: Use DataFrame.columns.duplicated() 
# To drop duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)

Now, let’s create a DataFrame with a few duplicate rows and columns, execute these examples, and validate the results. Our DataFrame contains duplicate column names Courses, Fee, Duration, Courses, Fee and Discount.


# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days","Spark",20000,1000], 
                 ["Pyspark",23000,"35days","Pyspark",23000,1500], 
                 ["Pandas",25000, "40days","Pandas",25000,2000],
                 ["Spark",20000, "30days","Spark",20000,1000]
               ]
columns = ["Courses","Fee", "Duration", "Subject","Fee", "Discount" ]
df=pd.DataFrame(technologies, columns= columns)
print("DataFrame:\n", df)

Yields below output. Notice from the DataFrame, that the column Fee is exactly duplicate, and columns Courses and Subject has the same data with different column names.

pandas remove duplicate columns

Use DataFrame.drop_duplicates() to Drop Duplicate Columns

To drop duplicate columns from pandas DataFrame use df.T.drop_duplicates().T, this removes all columns that have the same data regardless of column names.


# Drop duplicate columns
df2 = df.T.drop_duplicates().T
print("After dropping duplicate columns:\n", df2)

Yields below output.

pandas remove duplicate columns

It’s probably easiest to use a groupby (assuming they have duplicate names). Note that this doesn’t remove columns with different names and the same data.


# Use groupby() to drop duplicate columns
df2 = df.T.groupby(level=0).first().T
print(df2)

Yields below output. This returns columns in sorted order.


# Output:
   Courses Discount Duration    Fee  Subject
0    Spark     1000   30days  20000    Spark
1  Pyspark     1500   35days  23000  Pyspark
2   Pandas     2000   40days  25000   Pandas
3    Spark     1000   30days  20000    Spark

Drop Duplicated Columns Using DataFrame.loc[] Method

You can also try DataFrame.loc[] with DataFrame.columns.duplicated() methods. This also removes duplicate columns by matching column names and data.


# Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]
print(df2)

Yields the same output as above. Note that columns from Courses and Subject are not removed even though the columns have the same data.


# Output:
   Courses    Fee Duration  Subject  Discount
0    Spark  20000   30days    Spark      1000
1  Pyspark  23000   35days  Pyspark      1500
2   Pandas  25000   40days   Pandas      2000
3    Spark  20000   30days    Spark      1000

Drop Duplicate Columns of Pandas Keep = First

You can use DataFrame.duplicated() without any arguments to drop columns with the same values on all columns. It takes default values subset=None and keep=‘first’. The below example returns four columns after removing duplicate columns in our DataFrame.


# Remove repeted columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]
print(df2)

Yields the same output as in Section 2. This removes all duplicate columns regardless of column names.


# Output:
   Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
1  Pyspark  23000   35days      1500
2   Pandas  25000   40days      2000
3    Spark  20000   30days      1000

If you want to select all the duplicate columns and their last occurrence, you must pass a keep argument as "last". For instance, df.loc[:,~df.T.duplicated(keep='last')].


# keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]
print(df2)

Yields below output.


# Output:
  Duration  Courses    Fee  Discount
0   30days    Spark  20000      1000
1   35days  Pyspark  23000      1500
2   40days   Pandas  25000      2000

Use DataFrame.columns.duplicated() to Drop Duplicate Columns

lastly, try the below approach to dop/remove duplicate columns from pandas DataFrame.


# Use DataFrame.columns.duplicated() 
# To drop duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
print(df)

Yields below output.


# Output:
   Courses Duration  Subject  Discount
0    Spark   30days    Spark      1000
1  Pyspark   35days  Pyspark      1500
2   Pandas   40days   Pandas      2000
3    Spark   30days    Spark      1000

Complete Example of Remove Duplicate Columns


# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days","Spark",20000,1000], 
                 ["Pyspark",23000,"35days","Pyspark",23000,1500], 
                 ["Pandas",25000, "40days","Pandas",25000,2000],
                 ["Spark",20000, "30days","Spark",20000,1000]
               ]
columns = ["Courses","Fee", "Duration", "Subject","Fee", "Discount" ]
df=pd.DataFrame(technologies, columns= columns)
print(df)

# Drop duplicate columns
df2 = df.T.drop_duplicates().T
print(df2)

# Use groupby() to drop duplicate columns
df2 = df.T.groupby(level=0).first().T
print(df2)

# Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]
print(df2)

# Remove repeted columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]
print(df2)

# keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]
print(df2)

# Use Dataframe.columns.duplicated() 
# To drop duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
print(df)

FAQ on Drop Duplicate Columns in Pandas DataFrame

What does it mean to have duplicate columns in a Pandas DataFrame?

Duplicate columns are columns in a DataFrame that have the same column names or identical data across multiple columns. Dropping duplicate columns helps in cleaning the data and ensuring there is no redundancy.

How can I drop duplicate columns based on column names?

To remove columns with duplicate names, you can use the loc indexer combined with DataFrame.T.drop_duplicates().T.

How can I drop columns that have identical data?

To remove columns that have identical data (even if their names are different), you can use DataFrame.T.drop_duplicates().T. This method transposes the DataFrame, drops duplicates, and then transposes it back

How do I keep the first occurrence and remove the rest?

By default, both methods above keep the first occurrence and remove subsequent duplicates. If you want a different behavior (e.g., keeping the last occurrence), you can adjust it using the keep parameter.

Is there a built-in Pandas function to drop duplicate columns?

Pandas does not have a direct built-in function to drop duplicate columns, but using the DataFrame.T.drop_duplicates().T pattern is a standard and effective workaround.

Conclusion

In this article, you have learned how to drop/remove/delete duplicate columns from Panda DataFrame with examples like 1) dropping columns with the same names and data. 2) Dropping columns with different columns and the same data on all cells.

Happy Learning !!

References