In this R dplyr tutorial with examples, I will explain what is R? Introduction, dplyr verbs, and how to use them with examples. All examples provided in this R dplyr tutorial are basic, simple, and easy to practice for beginners who are enthusiastic to learn R and advance their careers. If you are new to R, I would recommend reading the R Programming beginners tutorial with examples.
dplyr is a package that provides a grammar of data manipulation, and provides a most used set of verbs that helps data science analysts to solve the most common data manipulation. All dplyr verbs take input as data.frame and return data.frame object.
When working with R data.frame, most of the R syntax takes $ to refer to the column name along with data frame object (df$id) and uses  notation, this syntax is not easy to read, and sometimes R code becomes confusing. Whereas R dplyr uses proper English verbs that are easily understandable by any programmer or analyst.
1. What is dplyr Package?
What does dplyr stand for?
d stands for data.frame,
plyr can be read as pliers, which is referred to as a tool to manipulate data frame.
1.1 R dplyr Introduction
dplyr is a package that provides a grammar of data manipulation, and provides the most used verbs that help data science analysts to solve the most common data manipulation. Using methods from this package over R base function results in better performance of the operations.
In order to use dplyr verbs, you have to install it first using
install.packages('dplyr') and load it using
library(dplyr). It provides the following methods and I will explain all these with examples.
|R dplyr verbs||dplyr verb description|
|mutate()||Adds new variables|
|arrange()||Ordering of the rows.|
|rename()||Rename variables name|
|slice()||Choose observations by position (location)|
|distinct()||Return distinct observation|
|rows_insert()||Insert Row to DataFrame|
|inner_join(), left_join(), |
|group_by() & summarise()||group_by() groups data. |
summarise() gives summary.
Alternatively, by installing
tidyverse package internally installs
1.2 Pipe Infix Operator %>%
All verbs in
dplyr package take
data.frame as a first argument. When we use
dplyr package, we mostly use the infix or pipe operator %>% in R from magrittr, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y) converted into f(x, y) so the result from left-hand side is then “piped” into the right-hand side. This pipe can be used to write multiple operations that you can read left-to-right. For most of the examples in this R dplyr tutorial, I will be using this infix operator.
2. Install dplyr Package
To install dplyr package, use install.packages() method. This method takes an argument as the package name you would like to install.
#Install just dplyr: install.packages("dplyr") # Alternatively, Install the entire tidyverse. # tidyverse include dplyr. install.packages("tidyverse")
3. Load dplyr Package
In order to use methods or verbs from
dplyr package, first, you need to load the library using the R
library(). Just input the package name in a string you wanted to load.
# Load dplyr library library('dplyr')
4. dplyr Examples
In this section of R dplyr tutorial, Let’s create an R DataFrame, run some of
dplyr verbs, and explore the output. If you already have data in CSV you can easily import CSV file to R DataFrame. Also, refer to Import Excel File into R.
# Create DataFrame df <- data.frame( id = c(10,11,12,13,14,15,16,17), name = c('sai','ram','deepika','sahithi','kumar','scott','Don','Lin'), gender = c('M','M','F','F','M','M','M','F'), dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16', '1995-03-02','1991-6-21','1986-3-24','1990-8-26')), state = c('CA','NY',NA,NA,'DC','DW','AZ','PH'), row.names=c('r1','r2','r3','r4','r5','r6','r7','r8') ) df
Yields below output.
# Output id name gender dob state r1 10 sai M 1990-10-02 CA r2 11 ram M 1981-03-24 NY r3 12 deepika F 1987-06-14 <NA> r4 13 sahithi F 1985-08-16 <NA> r5 14 kumar M 1995-03-02 DC r6 15 scott M 1991-06-21 DW r7 16 Don M 1986-03-24 AZ r8 17 Lin F 1990-08-26 PH
4.1 dplyr::filter() Examples
By using dplyr filter() function you can filter the R data frame rows by name, filter dataframe by column value, by multiple conditions e.t.c. Here,
%>% is an infix operator which acts as a pipe, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator.
# Load dplyr library library('dplyr') # filter() by row name df %>% filter(rownames(df) == 'r3') # filter() by column Value df %>% filter(gender == 'M') # filter() by list of values df %>% filter(state %in% c("CA", "AZ", "PH")) # filter() by multiple conditions df %>% filter(gender == 'M' & id > 15)
4.2 dplyr::select() Examples
dplyr select() function is used to select the columns or variables from the data frame. This takes the first argument as the data frame and the second argument is the variable name or vector of variable names. For more examples refer to select columns by name and select columns by index position.
# select() single column df %>% select('id') # select() multiple columns df %>% select(c('id','name')) # Select multiple columns by id df %>% select(c(1,2))
4.3 dplyr::slice() Examples
slice() function is used to slice the data frame rows based on index position also, and it is used to drop rows based on an index. Following are some other slice verbs provided in dplyr package.
|slice()||Slices the data.frame by row index|
|slice_head()||Select the first rows|
|slice_tail()||Select the last rows|
|slice_min()||Select the minimum of a column|
|slice_max()||Select the maximum of a column|
|slice_random()||Select random rows|
Following are several examples of usage of slice().
# Select rows 2 and 3 df %>% slice(2,3) # Select rows from list df %>% slice(c(2,3,5,6)) # select rows by range df %>% slice(2:6) # Drop rows using slice() df %>% slice(-2,-3,-4,-5,-6) # Drop by range df %>% slice(-2:-6)
4.4 dplyr::mutate() Examples
Use mutate() function and its other verbs
dplyr package to replace/update the values of the column (string, integer, or any type) in R DataFrame (data.frame).
# Replace on selected column df %>% mutate(name = str_replace(name, "sai", "SaiRam"))
4.5 dplyr::rename() Examples
The rename() function of dplyr is used to change the column name present in the data frame. The first example from the following renames the column from the old name
id to the new name
c1. Similarly use dplyr to rename multiple columns.
#Change the column name - c1 to id my_dataframe %>% rename("c1" = "id") # Rename multiple columns by name my_dataframe <- my_dataframe %>% rename("c1" = "id", "c2" = "pages", "c3" = "name") # Rename multiple columns by index my_dataframe <- my_dataframe %>% rename(col1 = 1, col2 = 2)
4.6 dplyr::distinct() Examples
distinct() function of dplyr is used to select the unique/distinct rows from the input data frame. Not using any column/variable names as arguments, this function returns unique rows by checking values on all columns.
# Create dataframe df=data.frame(id=c(11,11,33,44,44), pages=c(32,32,33,22,22), name=c("spark","spark","R","java","jsp"), chapters=c(76,76,11,15,15), price=c(144,144,321,567,567)) df # Load library dplyr library(dplyr) # Distinct rows df2 <- df %>% distinct() df2 # Distinct on selected columns df2 <- df %>% distinct(id,pages) df2
4.7 dplyr::arrange() Examples
dplyr arrange() function is used to sort the R dataframe rows by ascending or descending order based on column values.
# Create Data Frame df=data.frame(id=c(11,22,33,44,55), name=c("spark","python","R","jsp","java"), price=c(144,NA,321,567,567), publish_date= as.Date( c("2007-06-22", "2004-02-13", "2006-05-18", "2010-09-02","2007-07-20")) ) # Load dplyr library library(dplyr) # Using arrange in ascending order df2 <- df %>% arrange(price) df2
group_by() function in R is used to group the rows in a DataFrame by single or multiple columns and perform the aggregations.
# Create Data Frame df = read.csv('/Users/admin/apps/github/r-examples/resources/emp.csv') df # Load dplyr library(dplyr) # group_by() on department grp_tbl <- df %>% group_by(department) grp_tbl # summarise on groupped data. agg_tbl <- grp_tbl %>% summarise(sum(salary)) agg_tbl
In this R
dplyr tutorial, you have learned what is dplyr?, its usage of it, how to install, and load the library in order to use it in R programming, and finally explore different verbs with examples.