You are currently viewing How to Remove Duplicate Rows in R

How to remove duplicates or duplicate rows in R DataFrame (data.frame)? There are multiple ways to get the duplicate rows in R by removing all duplicates from a single column, selected columns, or all columns. In this article, I will explain all these examples by using functions from R base, dplyr, and data.table.

Advertisements

1. Quick Examples of Remove Duplicate Rows

The following are quick examples of how to remove duplicates or duplicate rows from R DataFrame (data.frame)


# Quick Examples

# Remove duplicate rows
df2 <- df[!duplicated(df), ]

# Remove duplicates by single column
df2 <- df[!duplicated(df$id), ]

# Remove duplicates on selected columns
df2 <- unique( df[ , c('id','pages','chapters','price') ] )

# Using dplyr
# Remove duplicate rows (all columns)
library(dplyr)
df2 <- df %>% distinct()

# Remove duplicates on specific column
df2 <- df %>% distinct(id, .keep_all = TRUE)

# Remove duplicates on selected columns
df2 <- df %>% distinct(id,pages, .keep_all = TRUE)

# using data.table
library(data.table)
dt <- data.table(df)
#Remove duplicates on specific column
dt2 <- unique(dt, by = "id")

Let’s create an R DataFrame, run these examples, and explore the output. If you already have data in CSV you can easily import CSV files to R DataFrame. Also, refer to Import Excel File into R.


# Create dataframe
df=data.frame(id=c(11,11,33,44,44),
              pages=c(32,32,33,22,22),
              name=c("spark","spark","R","java","jsp"),
              chapters=c(76,76,11,15,15),
              price=c(144,144,321,567,567))
df

Yields below output. Note that the first 2 rows have duplicates (all column values) and the last two rows have duplicates on columns of id, pages, chapters and price.


# Output
  id pages  name chapters price
1 11    32 spark       76   144
2 11    32 spark       76   144
3 33    33     R       11   321
4 44    22  java       15   567
5 44    22   jsp       15   567

2. Remove Duplicates using R Base Functions

R base provides duplicated() and unique() functions to remove duplicates in an R DataFrame (data.frame), By using these two functions we can delete duplicate rows by considering all columns, single column, or selected columns.

2.1 Remove Duplicate Rows

duplicated() is an R base function that takes vector or data.frame as input and selects rows that are duplicates, by negating the result you will remove all duplicate rows in the R data.frame. For example, from my data frame above having the first 2 rows are duplicated, running the below example eliminates duplicate rows and returns the data frame with unique rows.


# Remove duplicate rows
df2 <- df[!duplicated(df), ]
df2

# Output
#  id pages  name chapters price
# 1 11    32 spark       76   144
# 3 33    33     R       11   321
# 4 44    22  java       15   567
# 5 44    22   jsp       15   567

In case you want to remove duplicates based on a single column, use the column name as an argument to the function.


# Remove duplicates by single column
df2 <- df[!duplicated(df$id), ]
df2

# Output
#  id pages  name chapters price
# 1 11    32 spark       76   144
# 3 33    33     R       11   321
# 4 44    22  java       15   567

2.2 Remove Duplicates on Selected Columns

Use the unique() function to remove duplicates from the selected multiple columns of the R data frame. The following example removes duplicates by selecting columns id, pages, chapters and price.


# Remove duplicates on selected columns
df2 <- unique( df[ , c('id','pages','chapters','price') ] )
df2

# Output
#  id pages chapters price
# 1 11    32       76   144
# 3 33    33       11   321
# 4 44    22       15   567

3. Remove Duplicate Rows using dplyr

dplyr package provides distinct() function to remove duplicates, to use this, you need to load the library using library("dplyr") to use its methods. In case you don’t have this package, install it using install.packages("dplyr").

For bigger data sets it is best to use the methods from the dplyr package as they perform 30% faster. the dplyr package uses C++ code to evaluate.

3.1 Use distinct() to Remove Duplicates

distinct() method selects unique rows from a data frame by removing all duplicates in R. This is similar to the R base unique function but, this performs faster when you have large datasets, so use this when you want better performance.


# Using dplyr
# Remove duplicate rows (all columns)
library(dplyr)
df2 <- df %>% distinct()
df2

# Output
#  id pages  name chapters price
# 1 11    32 spark       76   144
# 2 33    33     R       11   321
# 3 44    22  java       15   567
# 4 44    22   jsp       15   567

Here, we use the infix operator %>% from magrittr, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y) converted into f(x, y) so the result from the left-hand side is then “piped” into the right-hand side. 

3.2 Remove Duplicates on Specific column

Similarly, you can also use this to get duplicate rows on a single column. Here, I am using an optional argument .keep_all=TRUE which keeps all variables in .data. If a combination of ... is not distinct, this keeps the first row of values.


#Remove duplicates on specific column
df2 <- df %>% distinct(id, .keep_all = TRUE)
df2

# Output
#  id pages  name chapters price
# 1 11    32 spark       76   144
# 2 33    33     R       11   321
# 3 44    22  java       15   567

3.3 Get Unique Rows on Selected Columns

If you want to get unique rows on selected columns of the R data.frame, just pass the columns as arguments to this distinct() function.


#Remove duplicates on selected columns
df2 <- df %>% distinct(id,pages, .keep_all = TRUE)
df2

# Output
#  id pages  name chapters price
# 1 11    32 spark       76   144
# 2 33    33     R       11   321
# 3 44    22  java       15   567

4. Remove Duplicate Rows using data.table

Use unique() function from data.table package to eliminate duplicates, data.table is a package that is used to work with tabular data in R Programming Language. It provides the efficient data.table object which is a much improved and better performance version of the default data.frame.


# using data.table
library(data.table)
dt <- data.table(df)
#Remove duplicates on specific column
dt2 <- unique(dt, by = "id")
dt2

# Output
#   id pages  name chapters price
#1: 11    32 spark       76   144
#2: 33    33     R       11   321
#3: 44    22  java       15   567

Frequently Asked Questions on Remove Duplicate Rows

How do I identify and count duplicate rows in a data frame in R?

You can use the duplicated() function to identify duplicates and sum(duplicated(df)) to count them in a data frame df.

What function is commonly used to remove duplicate rows in R?

The unique() function is commonly used to remove duplicate rows from a data frame in R.

How can I remove duplicates based on specific columns in R?

You can use the duplicated() function with the subset parameter to check for duplicates based on specific columns, and then use this information to filter the data frame accordingly.

How do I keep the first occurrence of each unique row and remove subsequent duplicates in R?

You can use the duplicated() function with the fromLast parameter set to TRUE to mark duplicates from the end. Then, you can use boolean indexing to keep the first occurrence.

Conclusion

In this article, you have learned how to remove duplicates or duplicate rows in R by using the R base function duplicated(), unique() and using the dplyr package function distinct() and finally using the unique() function from data.table. If a performance matters use either function from the dplyr or data.table.

Related Articles

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium