• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:10 mins read
You are currently viewing How to Drop Duplicate Columns in pandas DataFrame

By using pandas.DataFrame.T.drop_duplicates().T you can drop/remove/delete duplicate columns with the same name or a different name. This method removes all columns of the same name beside the first occurrence of the column and also removes columns that have the same data with a different column name. In this article, I will explain several ways to drop duplicate columns from Pandas DataFrame with examples.

Advertisements

Related:

1. Quick Examples of Removing Duplicate Columns in Pandas DataFrame

If you are in a hurry, below are some quick examples of dropping duplicate columns from DataFrame.


# Below are the quick examples

# Example 1: Drop duplicate columns
df2 = df.T.drop_duplicates().T

# Example 2: Use groupby() to drop duplicate columns
df2 = df.T.groupby(level=0).first().T

# Example 3: Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]

# Example 4: Remove repeted columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]

# Example 5: Keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]

# Example 6: Use DataFrame.columns.duplicated() to drop duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)

Now, let’s create a DataFrame with a few duplicate rows and columns, execute these examples, and validate the results. Our DataFrame contains duplicate column names Courses, Fee, Duration, Courses, Fee and Discount.


# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days","Spark",20000,1000], 
                 ["Pyspark",23000,"35days","Pyspark",23000,1500], 
                 ["Pandas",25000, "40days","Pandas",25000,2000],
                 ["Spark",20000, "30days","Spark",20000,1000]
               ]
columns = ["Courses","Fee", "Duration", "Subject","Fee", "Discount" ]
df=pd.DataFrame(technologies, columns= columns)
print("DataFrame:\n", df)

Yields below output. Notice from the DataFrame, that the column Fee is exactly duplicate, and columns Courses and Subject has the same data with different column names.

pandas remove duplicate columns

2. Use DataFrame.drop_duplicates() to Drop Duplicate Columns

To drop duplicate columns from pandas DataFrame use df.T.drop_duplicates().T, this removes all columns that have the same data regardless of column names.


# Drop duplicate columns
df2 = df.T.drop_duplicates().T
print("After dropping duplicate columns:\n", df2)

Yields below output.

pandas remove duplicate columns

It’s probably easiest to use a groupby (assuming they have duplicate names). Note that this doesn’t remove columns with different names and the same data.


# Use groupby() to drop duplicate columns
df2 = df.T.groupby(level=0).first().T
print(df2)

Yields below output. This returns columns in sorted order.


# Output:
   Courses Discount Duration    Fee  Subject
0    Spark     1000   30days  20000    Spark
1  Pyspark     1500   35days  23000  Pyspark
2   Pandas     2000   40days  25000   Pandas
3    Spark     1000   30days  20000    Spark

3. Drop Duplicated Columns Using DataFrame.loc[] Method

You can also try DataFrame.loc[] with DataFrame.columns.duplicated() methods. This also removes duplicate columns by matching column names and data.


# Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]
print(df2)

Yields the same output as above. Note that columns from Courses and Subject are not removed even though the columns have the same data.


# Output:
   Courses    Fee Duration  Subject  Discount
0    Spark  20000   30days    Spark      1000
1  Pyspark  23000   35days  Pyspark      1500
2   Pandas  25000   40days   Pandas      2000
3    Spark  20000   30days    Spark      1000

4. Drop Duplicate Columns of Pandas Keep = First

You can use DataFrame.duplicated() without any arguments to drop columns with the same values on all columns. It takes default values subset=None and keep=‘first’. The below example returns four columns after removing duplicate columns in our DataFrame.


# Remove repeted columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]
print(df2)

Yields the same output as in Section 2. This removes all duplicate columns regardless of column names.


# Output:
   Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
1  Pyspark  23000   35days      1500
2   Pandas  25000   40days      2000
3    Spark  20000   30days      1000

If you want to select all the duplicate columns and their last occurrence, you must pass a keep argument as "last". For instance, df.loc[:,~df.T.duplicated(keep='last')].


# keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]
print(df2)

Yields below output.


# Output:
  Duration  Courses    Fee  Discount
0   30days    Spark  20000      1000
1   35days  Pyspark  23000      1500
2   40days   Pandas  25000      2000

6. Use DataFrame.columns.duplicated() to Drop Duplicate Columns

lastly, try the below approach to dop/remove duplicate columns from pandas DataFrame.


# Use DataFrame.columns.duplicated() to drop duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
print(df)

Yields below output.


# Output:
   Courses Duration  Subject  Discount
0    Spark   30days    Spark      1000
1  Pyspark   35days  Pyspark      1500
2   Pandas   40days   Pandas      2000
3    Spark   30days    Spark      1000

7. Complete Example of Remove Duplicate Columns


# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days","Spark",20000,1000], 
                 ["Pyspark",23000,"35days","Pyspark",23000,1500], 
                 ["Pandas",25000, "40days","Pandas",25000,2000],
                 ["Spark",20000, "30days","Spark",20000,1000]
               ]
columns = ["Courses","Fee", "Duration", "Subject","Fee", "Discount" ]
df=pd.DataFrame(technologies, columns= columns)
print(df)

# Drop duplicate columns
df2 = df.T.drop_duplicates().T
print(df2)

# Use groupby() to drop duplicate columns
df2 = df.T.groupby(level=0).first().T
print(df2)

# Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]
print(df2)

# Remove repeted columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]
print(df2)

# keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]
print(df2)

# Use DataFrame.columns.duplicated() to drop duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
print(df)

Conclusion

In this article, you have learned how to drop/remove/delete duplicate columns from Panda DataFrame with examples like 1) dropping columns with the same names and data. 2) Dropping columns with different columns and the same data on all cells.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply