How to Remove Duplicate Rows in R

How to remove duplicates or duplicate rows in R DataFrame (data.frame)? There are multiple ways to get the duplicate rows in R by removing all duplicates from a single column, selected columns, or all columns. In this article, I will explain all these examples by using functions from R base, dplyr, and data.table.

1. Quick Examples of Remove Duplicate Rows

The following are quick examples of how to remove duplicates or duplicate rows from R DataFrame (data.frame)


# Quick Examples

# Remove duplicate rows
df2 <- df[!duplicated(df), ]

# Remove duplicates by single column
df2 <- df[!duplicated(df$id), ]

# Remove duplicates on selected columns
df2 <- unique( df[ , c('id','pages','chapters','price') ] )

# Using dplyr
# Remove duplicate rows (all columns)
library(dplyr)
df2 <- df %>% distinct()

# Remove duplicates on specific column
df2 <- df %>% distinct(id, .keep_all = TRUE)

# Remove duplicates on selected columns
df2 <- df %>% distinct(id,pages, .keep_all = TRUE)

# using data.table
library(data.table)
dt <- data.table(df)
#Remove duplicates on specific column
dt2 <- unique(dt, by = "id")

Let’s create an R DataFrame, run these examples and explore the output. If you already have data in CSV you can easily import CSV files to R DataFrame. Also, refer to Import Excel File into R.


# Create dataframe
df=data.frame(id=c(11,11,33,44,44),
              pages=c(32,32,33,22,22),
              name=c("spark","spark","R","java","jsp"),
              chapters=c(76,76,11,15,15),
              price=c(144,144,321,567,567))
df

Yields below output. Note that we have the first 2 rows with duplicates (all column values) and the last two rows with duplicates on columns id, pages, chapters and price.


# Output
  id pages  name chapters price
1 11    32 spark       76   144
2 11    32 spark       76   144
3 33    33     R       11   321
4 44    22  java       15   567
5 44    22   jsp       15   567

2. Remove Duplicates using R Base Functions

R base provides duplicated() and unique() functions to remove duplicates in an R DataFrame (data.frame), By using these two functions we can delete duplicate rows by considering all columns, single column, or selected columns.

2.1 Remove Duplicate Rows

duplicated() is an R base function that takes vector or data.frame as input and selects rows that are duplicates, by negating the result you will remove all duplicate rows in the R data.frame. For example, from my data frame above we have the first 2 rows duplicates, running the below example eliminates duplicate records and returns 1 record from the first 2.


# Remove duplicate rows
df2 <- df[!duplicated(df), ]
df2

# Output
#  id pages  name chapters price
#1 11    32 spark       76   144
#3 33    33     R       11   321
#4 44    22  java       15   567
#5 44    22   jsp       15   567

In case you wanted to remove duplicates based on a single column, use the column name as an argument to the function.


# Remove duplicates by single column
df2 <- df[!duplicated(df$id), ]
df2

# Output
#  id pages  name chapters price
#1 11    32 spark       76   144
#3 33    33     R       11   321
#4 44    22  java       15   567

2.2 Remove Duplicates on Selected Columns

Use the unique() function to remove duplicates from the selected columns of the R data frame. The following example removes duplicates by selecting columns id, pages, chapters and price.


# Remove duplicates on selected columns
df2 <- unique( df[ , c('id','pages','chapters','price') ] )
df2

# Output
#  id pages chapters price
#1 11    32       76   144
#3 33    33       11   321
#4 44    22       15   567

3. Remove Duplicate Rows using dplyr

dplyr package provides distinct() function to remove duplicates, In order to use this, you need to load the library using library("dplyr") to use its methods. In case you don’t have this package, install it using install.packages("dplyr").

For bigger data sets it is best to use the methods from dplyr package as they perform 30% faster. dplyr package uses C++ code to evaluate.

3.1 Use distinct() to Remove Duplicates

distinct() method selects unique rows from a data frame by removing all duplicates in R. This is similar to the R base unique function but, this performs faster when you have large datasets, so use this when you want better performance.


# Using dplyr
# Remove duplicate rows (all columns)
library(dplyr)
df2 <- df %>% distinct()
df2

# Output
#  id pages  name chapters price
#1 11    32 spark       76   144
#2 33    33     R       11   321
#3 44    22  java       15   567
#4 44    22   jsp       15   567

Here, we use the infix operator %>% from magrittr, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y) converted into f(x, y) so the result from the left-hand side is then “piped” into the right-hand side. 

3.2 Remove Duplicates on Specific column

Similarly, you can also use this to get duplicates rows on a single column. Here, I am using an optional argument .keep_all=TRUE which keeps all variables in .data. If a combination of ... is not distinct, this keeps the first row of values.


#Remove duplicates on specific column
df2 <- df %>% distinct(id, .keep_all = TRUE)
df2

# Output
#  id pages  name chapters price
#1 11    32 spark       76   144
#2 33    33     R       11   321
#3 44    22  java       15   567

3.3 Get Unique Rows on Selected Columns

If you wanted to get unique rows on selected columns of the R data.frame, just pass the columns as arguments to this distinct() function.


#Remove duplicates on selected columns
df2 <- df %>% distinct(id,pages, .keep_all = TRUE)
df2

# Output
#  id pages  name chapters price
#1 11    32 spark       76   144
#2 33    33     R       11   321
#3 44    22  java       15   567

4. Remove Duplicate Rows using data.table

Use unique() function from data.table package to eliminate duplicates, data.table is a package that is used to work with tabular data in R Programming Language. It provides the efficient data.table object which is a much improved and better performance version of the default data.frame.


# using data.table
library(data.table)
dt <- data.table(df)
#Remove duplicates on specific column
dt2 <- unique(dt, by = "id")
dt2

# Output
#   id pages  name chapters price
#1: 11    32 spark       76   144
#2: 33    33     R       11   321
#3: 44    22  java       15   567

Conclusion

In this article, you have learned how to remove duplicates or duplicate rows in R by using the R base function duplicated(), unique() and using dplyr package function distinct() and finally using unique() function from data.table. If a performance matters use either function from the dplyr or data.table.

Related Articles

References

r duplicate rows

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing How to Remove Duplicate Rows in R