You are currently viewing Explain tidyr package in R with Examples

In R, the tidyr package offers several powerful functions for transforming and reshaping data. The tidyr package is one of the most important packages in the R programming language which is part of the tidyverse and focuses on reshaping and cleaning data to make it “tidy.” Tidy data is structured so that each variable forms a column, each observation forms a row, and each type of observational unit forms a table. This structure makes data easier to work with for analysis and visualization. Below are the key features and functions of tidyr.

Advertisements

In this article, I will explain a complete guide to R’s tidyr package with well-defined examples.

Key Points-

Reshaping Data:

  • Use pivot_longer() to gather multiple columns into key-value pairs for easier analysis.
  • Use pivot_wider() to spread key-value pairs across multiple columns for better readability.

Handling Missing Values:

  • Use fill() to propagate non-missing values forward or backward within groups.
  • Use drop_na() to remove rows with missing values in specified columns.
  • Use replace_na() to replace NA values with a specified value, such as 0 or “Unknown”.

Column Splitting/Combining:

  • Use separate() to split a single column into multiple columns based on a delimiter.
  • Use unite() to combine multiple columns into one, optionally separating with a custom delimiter.

Tidy Data Principles

  • Each variable is a column, each observation is a row, and each type of observational unit forms a table.
  • Ensures compatibility with dplyr for transformations and ggplot2 for visualization, making the analysis pipeline smooth.

Why Use tidyr?

Efficiency

  • Tidyr’s concise functions like pivot_longer() and drop_na() replace complex manual workflows.
  • Saves time when working with messy datasets with repetitive patterns.

Data Preparation

  • Tidyr transforms data into a format that’s more conducive to analysis, visualization, or modeling.
  • Cleaning becomes intuitive with functions designed for real-world scenarios like missing data and inconsistent columns.

Integration with Tidyverse

  • Integrates seamlessly with dplyr for transformations (mutate(), filter()), ggplot2 for creating visualizations, and purrr for applying functions across data.
  • Ensures workflows remain consistent and modular, reducing debugging time.

Introduction to tidyr Package

The tidyr package, developed by Hadley Wickham, provides functions to organize data into tidy format. When combined with dplyr and magrittr, it facilitates building robust data analysis pipelines.

Install and loading

To begin using tidyr, install it with the following commands:


# Install and load the tidyr package
install.packages("tidyr")
library(tidyr)

Key Functions in tidyr with Examples

Below are the key functions of the tidyr package

R gather()

The gather() function transforms data from a wide to a long format. It collects multiple columns and their values into keyvalue pairs, creating a new column for the keys and another for the values. Columns that should not be included in the gathering process can be excluded by specifying them with the negative operator (-) within the function.


# Reshape the data to a long format using gather()
library(tidyr)
# Original data frame

df <- data.frame(
  Student = c("Geetha", "Ram", "Sai"),
  History = c(89, 81, 78),
  Math = c(75, 88, 85),
  Science = c(85, 92, 90)
)

print("Original Data:")
print(df)

# Reshape the data frame from wide to long
long_df <- gather(df, key = "Subject", value = "Score", -Student)
print("Transformed Data:")
print(long_df)
tidyr package in r

spread()

The spread() function from the tidyr package is used to transform data from a long to a wide format by converting key-value pairs into column names and their corresponding values, resulting in a new data frame in a wide format.


# Reshape the data frame from long to wide
wide_df <- spread(long_df, key = Subject, value = score)
print("Transformed Data:")
print(wide_df)

Yields below output.

tidyr package in r

separate()

separate() function from the tidyr package separates a single column into multiple columns based on a specified delimiter. Let’s update the above data frame by adding a new column and applying this function to split that column into two multiple columns based specified delimiter.


# Reshape the data using seperate()
library(tidyr)
# Original data frame

df <- data.frame(
  Student = c("Geetha", "Ram", "Sai"),
  History = c(89, 81, 78),
  Math = c(75, 88, 85),
  Science = c(85, 92, 90)
)
# Update the data frame
df$Total_Percentage = c("249_83%", "261_87%", "253_84%")
print("Original Data:")
print(df)
# Seperate the specified column into two columns
sep_data <- separate(df, col=Total_Percentage, into=c('Total', 'Percentage'), sep='_')
print("Transformed Data:")
print(sep_data)

Yields below output.


# Output:
[1] "Original Data:"

> print(df)
  Student History Math Science Total_Percentage
1  Geetha      89   75      85          249_83%
2     Ram      81   88      92          261_87%
3     Sai      78   85      90          253_84%

[1] "Transformed data:"

  Student History Math Science Total Percentage
1  Geetha      89   75      85   249        83%
2     Ram      81   88      92   261        87%
3     Sai      78   85      90   253        84%

unite()

unite() function is used to combine multiple columns into a single column. Let’s apply the unite() function to given data to merge specified columns of data frame into single column.


# Reshape the data using unite() function
library(tidyr)
uni_data <- unite(sep_data, col=Total_Percentage, c('Total', 'Percentage'), sep='_')
print("Transformed Data:")
print(uni_data)

Yields below output.


# Output:
[1] "Transformed data:"

> print(uni_data)
  Student History Math Science Total_Percentage
1  Geetha      89   75      85          249_83%
2     Ram      81   88      92          261_87%
3     Sai      78   85      90          253_84%

fill()

The fill() function of the tidyr package is used to fill the missing values in a data frame with the previous non-missing value. Let’s modify the last rows of a column in our data frame with missing values (NA) and then apply this function to replace them with the previous non-missing value.


# Remove NA values using fill()
library(tidyr)

# Update the data frame column with NA values 
long_df$Score = c(89, 81, 78, 75, 88, 85, NA, NA, NA)
print("Original Data:")
print(long_df)

# Fill NA value with previous value
fill_na <- fill(long_df, Score)
print("Transformed Data:")
print(fill_na)

Yields below output,


# Output:
[1] "Original Data:"
  Student Subject Score
1  Geetha History    89
2     Ram History    81
3     Sai History    78
4  Geetha    Math    75
5     Ram    Math    88
6     Sai    Math    85
7  Geetha Science    NA
8     Ram Science    NA
9     Sai Science    NA

"Transformed Data:"
  Student Subject Score
1  Geetha History    89
2     Ram History    81
3     Sai History    78
4  Geetha    Math    75
5     Ram    Math    88
6     Sai    Math    85
7  Geetha Science    85
8     Ram Science    85
9     Sai Science    85

drop_na()

The drop_na() function removes rows with missing values. This function takes the data frame as an argument and removes rows with missing values. It then returns the data frame with the remaining rows.


# Remove rows using drop_na() function 
library(tidyr)
long_df$Score = c(89, 81, 78, 75, 88, 85, NA, NA, NA)
remove_data = drop_na(long_df)
print("Transformed Data:")
print(remove_data)

Yields below output.


# Output:
[1] "Transformed Data:"

> print(remove_data)
  Student Subject Score
1  Geetha History    89
2     Ram History    81
3     Sai History    78
4  Geetha    Math    75
5     Ram    Math    88
6     Sai    Math    85

pivot_longer()

The pivote_longer() function transforms data from wide to long format. It is a more powerful version of the gather() function, which can handle multiple key-value pairs.


# Transform/reshape the data using pivote_longer()
library(tidyr)
long_df <- pivot_longer(df, cols = c(History, Math, Science),
                     names_to = 'Subject', values_to = 'Score')

print("Transformed Data:")
print(long_df)

Yields below output.


# Output:
[1] "Transformed Data:"

> print(long_df)
# A tibble: 9 × 3
  Student Subject Score
  <chr>   <chr>   <dbl>
1 Geetha  History    89
2 Geetha  Math       75
3 Geetha  Science    85
4 Ram     History    81
5 Ram     Math       88
6 Ram     Science    92
7 Sai     History    78
8 Sai     Math       85
9 Sai     Science    90

pivot_wider()

You can use the pivote_wider() function to transform the data from a long to to a wide format. Compared to the spread() function, it is more effective and flexible. It can handle more complex transformations.


# Transform the data using pivote_wider()
library(tidyr)
wide_df <- pivot_wider(long_df, names_from = Subject, values_from = Score)
print("Transformed Data:")
print(wide_df)

Yields below output.


# Output:
[1] "Transformed Data:"

> print(wide_df)
# A tibble: 3 × 4
  Student History  Math Science
  <chr>     <dbl> <dbl>   <dbl>
1 Geetha       89    75      85
2 Ram          81    88      92
3 Sai          78    85      90

separate_rows()

To separate rows of the data frame using the sepatate_rows() function from the tidyr package. First, we will update the the given data frame by adding a row of values as shown below.

Then use the seprate_rows() functions to separate the row having multiple values into two rows.


# Separtate the rows using separate_rows()
library(tidyr)
# Update the given data frame
remove_data[nrow(remove_data) + 1,] <- c("Geetha", "History, Math", "78")
print("Updated data:")
print(remove_data)

# Seperate the rows from a specified row
sep_data1 <-remove_data %>%
  separate_rows(Subject, Score)
print("Transformed Data:")
print(sep_data1)

Yields below output.


# Output:
"Updated data:"
  Student       Subject Score
1  Geetha       History    89
2     Ram       History    81
3     Sai       History    78
4  Geetha          Math    75
5     Ram          Math    88
6     Sai          Math    85
7  Geetha History, Math    78

[1] "Transformed Data:"
# A tibble: 8 × 3
  Student Subject Score
  <chr>   <chr>   <chr>
1 Geetha  History 89   
2 Ram     History 81   
3 Sai     History 78   
4 Geetha  Math    75   
5 Ram     Math    88   
6 Sai     Math    85   
7 Geetha  History 78   
8 Geetha  Math    78 

Complete()

The complete() function in R converts implicit missing values into explicit ones by providing the data frame that includes all possible combinations of specified columns. Missing combinations are filled with NA or provided default values.


# Transforming data using complete() function
library(tidyr)
complete_data <- remove_data %>% complete(Subject, Score)
print("Transformed Data:")
print(complete_data)

Yields below output.


# Output:
[1] "Transformed Data:"

> print(complete_data)
# A tibble: 18 × 3
   Subject       Score Student
   <chr>         <chr> <chr>  
 1 History       75    NA     
 2 History       78    Sai    
 3 History       81    Ram    
 4 History       85    NA     
 5 History       88    NA     
 6 History       89    Geetha 
 7 History, Math 75    NA     
 8 History, Math 78    Geetha 
 9 History, Math 81    NA     
10 History, Math 85    NA     
11 History, Math 88    NA     
12 History, Math 89    NA     
13 Math          75    Geetha 
14 Math          78    NA     
15 Math          81    NA     
16 Math          85    Sai    
17 Math          88    Ram    
18 Math          89    NA     

replace_na()

Finally, you can use the replace_na() function of tidyr package to replace NA (missing) values in a data frame with specified replacement values.


# Replace Na values with specified value
rep_na <- long_df %>%
  replace_na(list(Score = -1))
print("Transformed Data:")
print(rep_na)

Yields below output.


# Output:
[1] "Transformed Data:"
  Student Subject Score
1  Geetha History    89
2     Ram History    81
3     Sai History    -1
4  Geetha    Math    75
5     Ram    Math    -1
6     Sai    Math    85
7  Geetha Science    -1
8     Ram Science    92
9     Sai Science    90

Conclusion

In this article, I have explained an overview of the tidyr package in the R programming language. Data transformation is a critical step in data preparation and analysis, and the tidyr package provides an extensive set of tools for this purpose. Functions like gather(), spread(), pivot_longer(), and pivot_wider() simplify reshaping datasets, while utilities like separate(), unite(), and fill() address common data cleaning challenges.

Happy Learning!!