distinct() is a function of dplyr package that is used to select distinct or unique rows from the R data frame. In this article, I will explain the syntax, usage, and some examples of how to select distinct rows.
This function also supports eliminating duplicates from tibble and lazy data frames like dbplyr or dtplyr. For more examples of dplyr functions refer to the dplyr tutorial.
1. Syntax of dplyr distinct()
The following is the syntax of the dplyr distinct()
function.
# Syntax of sitinct()
distinct(.data, ..., .keep_all = FALSE)
Additionally, dplyr also has additional verbs for distinct distinct_all()
, distinct_at()
, distinct_if()
. For this article, I will mainly focus on the above syntax.
# Additional verbs
distinct_all(.tbl, .funs = list(), ..., .keep_all = FALSE)
distinct_at(.tbl, .vars, .funs = list(), ..., .keep_all = FALSE)
distinct_if(.tbl, .predicate, .funs = list(), ..., .keep_all = FALSE)
1.1 Parameters
.data
– A data.frame or an extension of data.frame for example tibble or lazy data frames like dbplyr or dtplyr....
Optional variables to determine unique rows..keep_all
– IfTRUE
, keep all variables/columns in the input data frame.
1.2 Return Type
- This returns an object of the same data type
.data
. - The result will be the subset of the input data frame.
- Variables/columns of data frame attributes are preserved.
2. distinct() of All Columns
distinct()
method selects unique/distinct rows from the input data frame. Not using any column/variable names as arguments, this function returns unique rows by checking values on all columns.
# Load dplyr package
library(dplyr)
# distinct() usage on all columns
df2 <- df %>% distinct()
df2
Yields below output. Here, we use the infix operator %>%
from magrittr
, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y)
converted into f(x, y)
so the result from the left-hand side is then “piped” into the right-hand side.
# Output
id pages name chapters price
1 11 32 spark 76 144
2 33 33 R 11 321
3 44 22 java 15 567
4 44 22 jsp 15 567
3. Distinct Rows of Selected Columns
You can also get distinct selected columns. Just pass column names you wanted to perform distinct on. By default, this returns the subset of the data frame (only columns you performed distinct on).
# Distinct on select columns
df2 <- df %>% distinct(id,pages)
df2
Yields below output.
# Output
id pages
1 11 32
2 33 33
3 44 22
4. Using .keep_all
By using the .keep_all=TRUE
argument it returns all columns from the data frame. By default, it takes the FALSE
value. let’s run the above example with the .keep=TRUE
argument and check the output.
# Distinct with keep_all (keep all columns)
df2 <- df %>% distinct(id,pages, .keep_all = TRUE)
df2
Yields below output.
# Output
id pages name chapters price
1 11 32 spark 76 144
2 33 33 R 11 321
3 44 22 java 15 567
4. Distinct Rows of Single Column
Finally, you can also perform distinct on a single column. If you want all columns to return then use .keep_all=TRUE
argument.
# Distinct of single column
df2 <- df %>% distinct(id, .keep_all = TRUE)
df2
Yields below output.
# Output
id pages name chapters price
1 11 32 spark 76 144
2 33 33 R 11 321
3 44 22 java 15 567
5. Complete Example
Following is a complete example of how to use dplyr distinct()
function.
# Create dataframe
df=data.frame(id=c(11,11,33,44,44),
pages=c(32,32,33,22,22),
name=c("spark","spark","R","java","jsp"),
chapters=c(76,76,11,15,15),
price=c(144,144,321,567,567))
df
# Load library dplyr
library(dplyr)
# Distinct rows
df2 <- df %>% distinct()
df2
# Distinct on selected columns
df2 <- df %>% distinct(id,pages)
df2
# Keep all columns
df2 <- df %>% distinct(id,pages,.keep_all = TRUE)
df2
# Distinct on specific column
df2 <- df %>% distinct(id, .keep_all = TRUE)
df2
6. Conclusion
In this article, you have learned the distinct() function, syntax, usage, its arguments, return value, and finally how to use it with examples.
Related Articles
- R filter() function from the dplyr package
- R select() function from the dplyr package
- R mutate() function from the dplyr package
- R rename() function from the dplyr package
- R slice() function from the dplyr package
- dplyr arrange() Function in R
- R lm() Function – Fitting Linear Models
- Reorder Columns of DataFrame in R