Pandas Remove Duplicate Columns From DataFrame

By using pandas.DataFrame.T.drop_duplicates().T you can drop/remove/delete duplicate columns with the same name or a different name. This method removes all columns of the same name besides the first occurrence of the column also removes columns that have the same data with the different column name. In this article, I will explain several ways of how to drop duplicate columns from Pandas DataFrame with examples.

Related:

1. Quick Examples of Remove Duplicate Columns in Pandas DataFrame

If you are in a hurry, below are some quick examples of how to drop duplicate columns in pandas DataFrame.


# Below are quick example

# Drop duplicate columns
df2 = df.T.drop_duplicates().T

# Use groupby() to drop duplicate columns
df2 = df.T.groupby(level=0).first().T

# Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]

# Remove repeted columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]

# keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]

# Use DataFrame.columns.duplicated() to drop duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)

Now, let’s create a DataFrame with a few duplicate rows and columns, execute these examples and validate results. Our DataFrame contains duplicate column names Courses, Fee, Duration, Courses, Fee and Discount.


# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days","Spark",20000,1000], 
                 ["Pyspark",23000,"35days","Pyspark",23000,1500], 
                 ["Pandas",25000, "40days","Pandas",25000,2000],
                 ["Spark",20000, "30days","Spark",20000,1000]
               ]
columns = ["Courses","Fee", "Duration", "Subject","Fee", "Discount" ]
df=pd.DataFrame(technologies, columns= columns)
print(df)

Yields below output. Notice from the DataFrame, the column Fee is exactly duplicate and columns Courses and Subject has the same data with different column names.


   Courses    Fee Duration  Subject    Fee  Discount
0    Spark  20000   30days    Spark  20000      1000
1  Pyspark  23000   35days  Pyspark  23000      1500
2   Pandas  25000   40days   Pandas  25000      2000
3    Spark  20000   30days    Spark  20000      1000

2. Use DataFrame.drop_duplicates() to Remove Duplicate Columns

To drop duplicate columns from pandas DataFrame use df.T.drop_duplicates().T, this removes all columns that have the same data regardless of column names.


# Drop duplicate columns
df2 = df.T.drop_duplicates().T
print(df2)

Yields below output.


   Courses    Fee Duration Discount
0    Spark  20000   30days     1000
1  Pyspark  23000   35days     1500
2   Pandas  25000   40days     2000
3    Spark  20000   30days     1000

It’s probably easiest to use a groupby (assuming they have duplicate names). Note that this doesn’t remove columns with different names and the same data.


# Use groupby() to drop duplicate columns
df2 = df.T.groupby(level=0).first().T
print(df2)

Yields below output. This returns columns in sorted order.


   Courses Discount Duration    Fee  Subject
0    Spark     1000   30days  20000    Spark
1  Pyspark     1500   35days  23000  Pyspark
2   Pandas     2000   40days  25000   Pandas
3    Spark     1000   30days  20000    Spark

3. Drop Duplicated Columns Using DataFrame.loc[] Method

You can also try DataFrame.loc[] with DataFrame.columns.duplicated() methods. This also removes duplicate columns by matching column names and data.


# Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]
print(df2)

Yields same output as above. Note that columns from Courses and Subject are not removed even though the columns have the same data.


   Courses    Fee Duration  Subject  Discount
0    Spark  20000   30days    Spark      1000
1  Pyspark  23000   35days  Pyspark      1500
2   Pandas  25000   40days   Pandas      2000
3    Spark  20000   30days    Spark      1000

4. Use DataFrame.loc[] to Drop Duplicate and Keep First Columns

You can use DataFrame.duplicated() without any arguments to drop columns with the same values on all columns. It takes defaults values subset=None and keep=‘first’. The below example returns four columns after removing duplicate columns in our DataFrame.


# Remove repeted columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]
print(df2)

Yields the same output as in Section 2. This removes all duplicate columns regardless of column names.


   Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
1  Pyspark  23000   35days      1500
2   Pandas  25000   40days      2000
3    Spark  20000   30days      1000

If you want to select all the duplicate columns and their last occurrence, you must pass a keep argument as "last". For instance, df.loc[:,~df.T.duplicated(keep='last')].


# keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]
print(df2)

Yields below output.


  Duration  Courses    Fee  Discount
0   30days    Spark  20000      1000
1   35days  Pyspark  23000      1500
2   40days   Pandas  25000      2000

6. Use DataFrame.columns.duplicated() to Drop Duplicate Columns

lastly, try the below approach to dop/remove duplicate columns from pandas DataFrame.


# Use DataFrame.columns.duplicated() to drop duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
print(df)

Yields below output.


   Courses Duration  Subject  Discount
0    Spark   30days    Spark      1000
1  Pyspark   35days  Pyspark      1500
2   Pandas   40days   Pandas      2000
3    Spark   30days    Spark      1000

7. Complete Example to Remove Duplicate Columns in pandas DataFrame


# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days","Spark",20000,1000], 
                 ["Pyspark",23000,"35days","Pyspark",23000,1500], 
                 ["Pandas",25000, "40days","Pandas",25000,2000],
                 ["Spark",20000, "30days","Spark",20000,1000]
               ]
columns = ["Courses","Fee", "Duration", "Subject","Fee", "Discount" ]
df=pd.DataFrame(technologies, columns= columns)
print(df)

# Drop duplicate columns
df2 = df.T.drop_duplicates().T
print(df2)

# Use groupby() to drop duplicate columns
df2 = df.T.groupby(level=0).first().T
print(df2)

# Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]
print(df2)

# Remove repeted columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]
print(df2)

# keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]
print(df2)

# Use DataFrame.columns.duplicated() to drop duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
print(df)

Conclusion

In this article, you have learned how to drop/remove/delete duplicate columns from Panda DataFrame with examples like 1) dropping columns with the same names and data. 2) Dropping columns with different columns and same data on all cells.

Happy Learning !!

You May Also Like

References

Leave a Reply

Pandas Remove Duplicate Columns From DataFrame