You are currently viewing Mastering Data Transformation in R with dplyr and tidyr

Data manipulation is a cornerstone of effective data analysis, whether you’re cleaning datasets, transforming data, or preparing it for complex statistical models. In R, the dplyr and tidyr packages (part of the tidyverse) offer an intuitive and powerful suite of tools for wrangling data. In this article, I will explain you through essential functions like arrange(), select(), filter(), mutate(), group_by(), summarise(), spread(), and gather() with practical examples.

Advertisements

Why Use R for Data Transformation?

  • Open-Source Advantage: R is free and supported by the R Foundation for Statistical Computing.
  • User-Friendly Environment: With integrated development tools like RStudio, R becomes a beginner-friendly yet powerful tool for data manipulation.
  • Tidyverse Packages: Tools like dplyr and tidyr make R particularly effective for data science tasks.

Let’s explore how to use these functions to transform data effectively in R.

Installing and Loading R tidyverse

Let’s install and load the tidyverse package,


# Install tidyverse package
install.packages("tidyverse")

# Load tidyverse
library(tidyverse)

Data Transformation with R dplyr Functions

Arrange Rows with arrange()

The arrange() function is used to arrange rows of a data frame by single or multiple columns. This function takes a specified column as an argument and, by default, arranges the row values of that column in ascending order. When a single column is provided, it sorts the rows by that column, with other columns arranged accordingly. For multiple columns, each subsequent column determines the order for rows with identical values in the preceding columns.


# Load dplyr library
library(dplyr)

# Create data frame
df <- data.frame(
  id = c(10, NA, -1, 13), 
  name = c("Sai", "Ram", "Geetha", "Jhon"), 
  dob = c("1990-05-15", "1990-05-16", "1990-05-17", "1990-05-18"), 
  gender = c("m", "m", "f", "m")
)
print("Create data frame:")
df

# Arrange a specified column in ascending order
transformed_df <- df %>% arrange(name)
print("After transforming the data:")
print(transformed_df)

Yields below output.

transformations in r

Alternatively, you can use the arrange() function with the desc() function to sort rows in descending order by one or multiple columns.


# Arrange a specified column in descending order
# Load dplyr library
library(dplyr)
transformed_df <- df %>% arrange(desc(name))
print("After transforming the data:")
print(transformed_df)

Yields below output.

transformations in r

Select Columns with select()

The select() function from the dplyr package is used to select specific columns or variables from a data frame by their name or index. Simply provide the names of the columns you want to extract as arguments to this function to obtain a data frame containing only the specified columns.


# Select specified columns
# Load dplyr library
library(dplyr) 
transformed_df <- df %>% select(name, gender)
print("After transforming the data:")
print(transformed_df)

# Output:
# [1] "After transforming the data:"
#     name gender
# 1    Sai      m
# 2    Ram      m
# 3 Geetha      f
# 4   Jhon      m

You can also use this function to get the range of columns from the data frame. In this example, I will extract the variables based on their indexes.


# Select a specified range of columns 
# Load dplyr library
library(dplyr)
transformed_df <- df %>% select(1:4)
print("After transforming the data:")
print(transformed_df)

# Output:
# [1] "After transforming the data:"
#   id   name        dob
# 1 10    Sai 1990-05-15
# 2 NA    Ram 1990-05-16
# 3 -1 Geetha 1990-05-17
# 4 13   Jhon 1990-05-18

Filter Rows with filter()

You can use the R dplyr filter() function to filter the rows of a data frame based on column values, row names or indexes, or a certain condition. In the following example, we will use the filter() function to subset the data frame by column values and the mutate() function to add a new column with updated values.


# Filter rows where gender is 'm' and 
# create a new column with transformed ID values
# Load dplyr library
library(dplyr)
transformed_data <- df %>%
  filter(gender == 'f') %>%
  mutate(id = ifelse(id < 0, NA, id))
print("After transforming the data:")
print(transformed_data)

# Output:
# [1] "After transforming the data:"
#   id   name        dob gender
# 1 NA Geetha 1990-05-17      f

 Add or Modify Columns with mutate()

The mutate() function from the dplyr package is used to add new columns or modify existing columns in a data frame and is a key tool for data transformation. Let’s use the mutate() function to add a new column to an existing data frame.


# Add a column indicating age
transformed_df <- df %>%
  mutate(age = 2024 - as.numeric(substr(dob, 1, 4)))
print("After transforming the data:")
print(transformed_df)

# Output:
   id   name        dob gender age
1  10    Sai 1990-05-15      m  34
2  NA    Ram 1990-05-16      m  34
3  -1 Geetha 1990-05-17      f  34
4  13   Jhon 1990-05-18      m  34

Group and Summarise Data

The group_by() function from the dplyr package in R is used to group rows in a data frame based on column values, while summarise() computes summary statistics for the grouped data. Let’s use these functions together to perform transformations on grouped data.


# Group by gender and calculate the average ID
# Load dplyr library
library(dplyr)
summary_data <- df %>%
  group_by(gender) %>%
  summarise(avg_id = mean(id, na.rm = TRUE))
print("Grouped summary data:")
print(summary_data)


# Output:
# # A tibble: 2 × 2
#   gender avg_id
#   <chr>   <dbl>
# 1 f        -1  
# 2 m        11.5

Data Transformations with R tidyr Functions

Reshape to Wide Format with spread()

The R spread() function from the tidyr package is used to reshape the data in order to make the analysis easier. This process is called tidying your data.


# Reshape the data using spread()
# Load tidyr package
library(tidyr)

# Transform the data to a long format 
long_df <- data.frame(
  id = c(10, 10, -1, -1),
  key = c("dob", "gender", "dob", "gender"),
  value = c("1990-05-15", "m", "1990-05-17", "f")
)

print("Long Format Data:")
print(long_df)

# transform to wide format
wide_df <- long_df %>% spread(key = key, value = value)

print("Wide Format Data:")
print(wide_df)

# Output:
# [1] "Long Format Data:"
#   id    key      value
# 1 10    dob 1990-05-15
# 2 10 gender          m
# 3 -1    dob 1990-05-17
# 4 -1 gender          f

# [1] "Wide Format Data:"
#   id        dob gender
# 1 -1 1990-05-17      f
# 2 10 1990-05-15      m

Reshape to Long Format with gather()

The gather() function from the tidyr package is used to reshape data from wide format to long format. Here’s how you can use this function to transform the given data frame.


# Reshape data using gather()
# Load tidyr 
library(tidyr)
df_long <- gather(df, key = "attribute", value = "value", id:gender)

print("Reshaped data frame:")
print(df_long)

# Output:
#    attribute      value
# 1         id         10
# 2         id         NA
# 3         id         -1
# 4         id         13
# 5       name        Sai
# 6       name        Ram
# 7       name      Geetha
# 8       name       Jhon
# 9        dob 1990-05-15
# 10       dob 1990-05-16
# 11       dob 1990-05-17
# 12       dob 1990-05-18
# 13    gender          m
# 14    gender          m
# 15    gender          f
# 16    gender          m

Conclusion

In this article, I have explained using the dplyr and tidyr packages how to clean and transform datasets effectively in R. Mastering functions like arrange(), select(), filter(), mutate(), and group_by() makes it easier to perform complex data manipulations. By combining these tools, you can unlock deeper insights and streamline your data analysis workflow.

Happy Learning!!