By using pandas.DataFrame.T.drop_duplicates().T
you can drop/remove/delete duplicate columns with the same name or a different name. This method removes all columns of the same name beside the first occurrence of the column and also removes columns that have the same data with a different column name. In this article, I will explain several ways to drop duplicate columns from Pandas DataFrame with examples.
Related:
1. Quick Examples of Removing Duplicate Columns in Pandas DataFrame
If you are in a hurry, below are some quick examples of dropping duplicate columns from DataFrame.
# Below are the quick examples
# Example 1: Drop duplicate columns
df2 = df.T.drop_duplicates().T
# Example 2: Use groupby() to drop duplicate columns
df2 = df.T.groupby(level=0).first().T
# Example 3: Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]
# Example 4: Remove repeted columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]
# Example 5: Keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]
# Example 6: Use DataFrame.columns.duplicated() to drop duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
Now, let’s create a DataFrame with a few duplicate rows and columns, execute these examples, and validate the results. Our DataFrame contains duplicate column names Courses
, Fee
, Duration
, Courses
, Fee
and Discount
.
# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days","Spark",20000,1000],
["Pyspark",23000,"35days","Pyspark",23000,1500],
["Pandas",25000, "40days","Pandas",25000,2000],
["Spark",20000, "30days","Spark",20000,1000]
]
columns = ["Courses","Fee", "Duration", "Subject","Fee", "Discount" ]
df=pd.DataFrame(technologies, columns= columns)
print("DataFrame:\n", df)
Yields below output. Notice from the DataFrame, that the column Fee is exactly duplicate, and columns Courses
and Subject
has the same data with different column names.
2. Use DataFrame.drop_duplicates() to Drop Duplicate Columns
To drop duplicate columns from pandas DataFrame use df.T.drop_duplicates().T
, this removes all columns that have the same data regardless of column names.
# Drop duplicate columns
df2 = df.T.drop_duplicates().T
print("After dropping duplicate columns:\n", df2)
Yields below output.
It’s probably easiest to use a groupby (assuming they have duplicate names). Note that this doesn’t remove columns with different names and the same data.
# Use groupby() to drop duplicate columns
df2 = df.T.groupby(level=0).first().T
print(df2)
Yields below output. This returns columns in sorted order.
# Output:
Courses Discount Duration Fee Subject
0 Spark 1000 30days 20000 Spark
1 Pyspark 1500 35days 23000 Pyspark
2 Pandas 2000 40days 25000 Pandas
3 Spark 1000 30days 20000 Spark
3. Drop Duplicated Columns Using DataFrame.loc[] Method
You can also try DataFrame.loc[] with DataFrame.columns.duplicated() methods. This also removes duplicate columns by matching column names and data.
# Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]
print(df2)
Yields the same output as above. Note that columns from Courses
and Subject
are not removed even though the columns have the same data.
# Output:
Courses Fee Duration Subject Discount
0 Spark 20000 30days Spark 1000
1 Pyspark 23000 35days Pyspark 1500
2 Pandas 25000 40days Pandas 2000
3 Spark 20000 30days Spark 1000
4. Drop Duplicate Columns of Pandas Keep = First
You can use DataFrame.duplicated
() without any arguments to drop columns with the same values on all columns. It takes default values subset=None
and keep=‘first’
. The below example returns four columns after removing duplicate columns in our DataFrame.
# Remove repeted columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]
print(df2)
Yields the same output as in Section 2. This removes all duplicate columns regardless of column names.
# Output:
Courses Fee Duration Discount
0 Spark 20000 30days 1000
1 Pyspark 23000 35days 1500
2 Pandas 25000 40days 2000
3 Spark 20000 30days 1000
If you want to select all the duplicate columns and their last occurrence, you must pass a keep argument as "last"
. For instance, df.loc[:,~df.T.duplicated(keep='last')]
.
# keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]
print(df2)
Yields below output.
# Output:
Duration Courses Fee Discount
0 30days Spark 20000 1000
1 35days Pyspark 23000 1500
2 40days Pandas 25000 2000
6. Use DataFrame.columns.duplicated() to Drop Duplicate Columns
lastly, try the below approach to dop/remove duplicate columns from pandas DataFrame.
# Use DataFrame.columns.duplicated() to drop duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
print(df)
Yields below output.
# Output:
Courses Duration Subject Discount
0 Spark 30days Spark 1000
1 Pyspark 35days Pyspark 1500
2 Pandas 40days Pandas 2000
3 Spark 30days Spark 1000
7. Complete Example of Remove Duplicate Columns
# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days","Spark",20000,1000],
["Pyspark",23000,"35days","Pyspark",23000,1500],
["Pandas",25000, "40days","Pandas",25000,2000],
["Spark",20000, "30days","Spark",20000,1000]
]
columns = ["Courses","Fee", "Duration", "Subject","Fee", "Discount" ]
df=pd.DataFrame(technologies, columns= columns)
print(df)
# Drop duplicate columns
df2 = df.T.drop_duplicates().T
print(df2)
# Use groupby() to drop duplicate columns
df2 = df.T.groupby(level=0).first().T
print(df2)
# Remove duplicate columns pandas DataFrame
df2 = df.loc[:,~df.columns.duplicated()]
print(df2)
# Remove repeted columns in a DataFrame
df2 = df.loc[:,~df.T.duplicated(keep='first')]
print(df2)
# keep last duplicate columns
df2 = df.loc[:,~df.T.duplicated(keep='last')]
print(df2)
# Use DataFrame.columns.duplicated() to drop duplicate columns
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
print(df)
Conclusion
In this article, you have learned how to drop/remove/delete duplicate columns from Panda DataFrame with examples like 1) dropping columns with the same names and data. 2) Dropping columns with different columns and the same data on all cells.
Happy Learning !!
Related Articles
- Pandas Drop Multiple Columns From DataFrame
- Pandas Drop First/Last N Columns From DataFrame
- How to drop the Pandas column by index?
- Drop Pandas first column from DataFrame.
- Pandas Drop Columns with NaN or None Values
- Drop the last column from the DataFrame
- Pandas Drop Index Column Explained
- Pandas Drop the First Row of DataFrame
- Pandas Drop Rows by Index
- Drop multiple columns by index