How to remove duplicates or duplicate rows in R DataFrame (data.frame)? There are multiple ways to get the duplicate rows in R by removing all duplicates from a single column, selected columns, or all columns. In this article, I will explain all these examples by using functions from R base, dplyr, and data.table.
- Remove duplicates using R base functions
- Remove duplicate rows using dplyr
- Remove duplicate rows using data.table
1. Quick Examples of Remove Duplicate Rows
The following are quick examples of how to remove duplicates or duplicate rows from R DataFrame (data.frame)
# Quick Examples
# Remove duplicate rows
df2 <- df[!duplicated(df), ]
# Remove duplicates by single column
df2 <- df[!duplicated(df$id), ]
# Remove duplicates on selected columns
df2 <- unique( df[ , c('id','pages','chapters','price') ] )
# Using dplyr
# Remove duplicate rows (all columns)
library(dplyr)
df2 <- df %>% distinct()
# Remove duplicates on specific column
df2 <- df %>% distinct(id, .keep_all = TRUE)
# Remove duplicates on selected columns
df2 <- df %>% distinct(id,pages, .keep_all = TRUE)
# using data.table
library(data.table)
dt <- data.table(df)
#Remove duplicates on specific column
dt2 <- unique(dt, by = "id")
Let’s create an R DataFrame, run these examples and explore the output. If you already have data in CSV you can easily import CSV files to R DataFrame. Also, refer to Import Excel File into R.
# Create dataframe
df=data.frame(id=c(11,11,33,44,44),
pages=c(32,32,33,22,22),
name=c("spark","spark","R","java","jsp"),
chapters=c(76,76,11,15,15),
price=c(144,144,321,567,567))
df
Yields below output. Note that we have the first 2 rows with duplicates (all column values) and the last two rows with duplicates on columns id
, pages
, chapters
and price
.
# Output
id pages name chapters price
1 11 32 spark 76 144
2 11 32 spark 76 144
3 33 33 R 11 321
4 44 22 java 15 567
5 44 22 jsp 15 567
2. Remove Duplicates using R Base Functions
R base provides duplicated()
and unique()
functions to remove duplicates in an R DataFrame (data.frame), By using these two functions we can delete duplicate rows by considering all columns, single column, or selected columns.
2.1 Remove Duplicate Rows
duplicated()
is an R base function that takes vector or data.frame as input and selects rows that are duplicates, by negating the result you will remove all duplicate rows in the R data.frame. For example, from my data frame above we have the first 2 rows duplicates, running the below example eliminates duplicate records and returns 1 record from the first 2.
# Remove duplicate rows
df2 <- df[!duplicated(df), ]
df2
# Output
# id pages name chapters price
#1 11 32 spark 76 144
#3 33 33 R 11 321
#4 44 22 java 15 567
#5 44 22 jsp 15 567
In case you wanted to remove duplicates based on a single column, use the column name as an argument to the function.
# Remove duplicates by single column
df2 <- df[!duplicated(df$id), ]
df2
# Output
# id pages name chapters price
#1 11 32 spark 76 144
#3 33 33 R 11 321
#4 44 22 java 15 567
2.2 Remove Duplicates on Selected Columns
Use the unique()
function to remove duplicates from the selected columns of the R data frame. The following example removes duplicates by selecting columns id
, pages
, chapters
and price
.
# Remove duplicates on selected columns
df2 <- unique( df[ , c('id','pages','chapters','price') ] )
df2
# Output
# id pages chapters price
#1 11 32 76 144
#3 33 33 11 321
#4 44 22 15 567
3. Remove Duplicate Rows using dplyr
dplyr package provides distinct() function to remove duplicates, In order to use this, you need to load the library using library("dplyr")
to use its methods. In case you don’t have this package, install it using install.packages("dplyr")
.
For bigger data sets it is best to use the methods from dplyr
package as they perform 30% faster. dplyr
package uses C++ code to evaluate.
3.1 Use distinct() to Remove Duplicates
distinct() method selects unique rows from a data frame by removing all duplicates in R. This is similar to the R base unique function but, this performs faster when you have large datasets, so use this when you want better performance.
# Using dplyr
# Remove duplicate rows (all columns)
library(dplyr)
df2 <- df %>% distinct()
df2
# Output
# id pages name chapters price
#1 11 32 spark 76 144
#2 33 33 R 11 321
#3 44 22 java 15 567
#4 44 22 jsp 15 567
Here, we use the infix operator %>%
from magrittr
, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y)
converted into f(x, y)
so the result from the left-hand side is then “piped” into the right-hand side.
3.2 Remove Duplicates on Specific column
Similarly, you can also use this to get duplicates rows on a single column. Here, I am using an optional argument .keep_all=TRUE
which keeps all variables in .data
. If a combination of ...
is not distinct, this keeps the first row of values.
#Remove duplicates on specific column
df2 <- df %>% distinct(id, .keep_all = TRUE)
df2
# Output
# id pages name chapters price
#1 11 32 spark 76 144
#2 33 33 R 11 321
#3 44 22 java 15 567
3.3 Get Unique Rows on Selected Columns
If you wanted to get unique rows on selected columns of the R data.frame, just pass the columns as arguments to this distinct() function.
#Remove duplicates on selected columns
df2 <- df %>% distinct(id,pages, .keep_all = TRUE)
df2
# Output
# id pages name chapters price
#1 11 32 spark 76 144
#2 33 33 R 11 321
#3 44 22 java 15 567
4. Remove Duplicate Rows using data.table
Use unique()
function from data.table package to eliminate duplicates, data.table is a package that is used to work with tabular data in R Programming Language. It provides the efficient data.table
object which is a much improved and better performance version of the default data.frame
.
# using data.table
library(data.table)
dt <- data.table(df)
#Remove duplicates on specific column
dt2 <- unique(dt, by = "id")
dt2
# Output
# id pages name chapters price
#1: 11 32 spark 76 144
#2: 33 33 R 11 321
#3: 44 22 java 15 567
Conclusion
In this article, you have learned how to remove duplicates or duplicate rows in R by using the R base function duplicated(), unique() and using dplyr package function distinct() and finally using unique() function from data.table. If a performance matters use either function from the dplyr or data.table.
Related Articles
- How to remove rows in R
- How to remove columns in R
- How to remvoe rows with NA in R
- How to select columns in R
- How to rename columns in R
- Remove Character From String in R
- How to Remove NA from Vector?
- Uninstall or Remove Package from R Environment
- R Remove Duplicates From Vector