You are currently viewing dplyr distinct() Function Usage & Examples

distinct() is a function of dplyr package that is used to select distinct or unique rows from the R data frame. In this article, I will explain the syntax, usage, and some examples of how to select distinct rows.

Advertisements

This function also supports eliminating duplicates from tibble and lazy data frames like dbplyr or dtplyr. For more examples of dplyr functions refer to the dplyr tutorial.

1. Syntax of dplyr distinct()

The following is the syntax of the dplyr distinct() function.


# Syntax of sitinct()
distinct(.data, ..., .keep_all = FALSE)

Additionally, dplyr also has additional verbs for distinct distinct_all(), distinct_at(), distinct_if(). For this article, I will mainly focus on the above syntax.


# Additional verbs
distinct_all(.tbl, .funs = list(), ..., .keep_all = FALSE)

distinct_at(.tbl, .vars, .funs = list(), ..., .keep_all = FALSE)

distinct_if(.tbl, .predicate, .funs = list(), ..., .keep_all = FALSE)

1.1 Parameters

  • .data – A data.frame or an extension of data.frame for example tibble or lazy data frames like dbplyr or dtplyr.
  • ...Optional variables to determine unique rows.
  • .keep_all – If TRUE, keep all variables/columns in the input data frame.

1.2 Return Type

  • This returns an object of the same data type .data.
  • The result will be the subset of the input data frame.
  • Variables/columns of data frame attributes are preserved.

2. distinct() of All Columns

distinct() method selects unique/distinct rows from the input data frame. Not using any column/variable names as arguments, this function returns unique rows by checking values on all columns.


# Load dplyr package
library(dplyr)

# distinct() usage on all columns
df2 <- df %>% distinct()
df2

Yields below output. Here, we use the infix operator %>% from magrittr, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y) converted into f(x, y) so the result from the left-hand side is then “piped” into the right-hand side. 


# Output
  id pages  name chapters price
1 11    32 spark       76   144
2 33    33     R       11   321
3 44    22  java       15   567
4 44    22   jsp       15   567

3. Distinct Rows of Selected Columns

You can also get distinct selected columns. Just pass column names you wanted to perform distinct on. By default, this returns the subset of the data frame (only columns you performed distinct on).


# Distinct on select columns
df2 <- df %>% distinct(id,pages)
df2

Yields below output.


# Output
  id pages
1 11    32
2 33    33
3 44    22

4. Using .keep_all

By using the .keep_all=TRUE argument it returns all columns from the data frame. By default, it takes the FALSE value. let’s run the above example with the .keep=TRUE argument and check the output.


# Distinct with keep_all (keep all columns)
df2 <- df %>% distinct(id,pages, .keep_all = TRUE)
df2

Yields below output.


# Output
  id pages  name chapters price
1 11    32 spark       76   144
2 33    33     R       11   321
3 44    22  java       15   567

4. Distinct Rows of Single Column

Finally, you can also perform distinct on a single column. If you want all columns to return then use .keep_all=TRUE argument.


# Distinct of single column
df2 <- df %>% distinct(id, .keep_all = TRUE)
df2

Yields below output.


# Output
  id pages  name chapters price
1 11    32 spark       76   144
2 33    33     R       11   321
3 44    22  java       15   567

5. Complete Example

Following is a complete example of how to use dplyr distinct() function.


# Create dataframe
df=data.frame(id=c(11,11,33,44,44),
              pages=c(32,32,33,22,22),
              name=c("spark","spark","R","java","jsp"),
              chapters=c(76,76,11,15,15),
              price=c(144,144,321,567,567))
df

# Load library dplyr
library(dplyr)

# Distinct rows
df2 <- df %>% distinct()
df2

# Distinct on selected columns
df2 <- df %>% distinct(id,pages)
df2

# Keep all columns
df2 <- df %>% distinct(id,pages,.keep_all = TRUE)
df2

# Distinct on specific column
df2 <- df %>% distinct(id, .keep_all = TRUE)
df2

6. Conclusion

In this article, you have learned the distinct() function, syntax, usage, its arguments, return value, and finally how to use it with examples.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium