In R, the tidyr
package offers several powerful functions for transforming and reshaping data. The tidyr package is one of the most important packages in the R programming language which is part of the tidyverse and focuses on reshaping and cleaning data to make it “tidy.” Tidy data is structured so that each variable forms a column, each observation forms a row, and each type of observational unit forms a table. This structure makes data easier to work with for analysis and visualization. Below are the key features and functions of tidyr
.
In this article, I will explain a complete guide to R’s tidyr package with well-defined examples.
Key Points-
Reshaping Data:
- Use
pivot_longer()
to gather multiple columns into key-value pairs for easier analysis. - Use
pivot_wider()
to spread key-value pairs across multiple columns for better readability.
Handling Missing Values:
- Use
fill()
to propagate non-missing values forward or backward within groups. - Use
drop_na()
to remove rows with missing values in specified columns. - Use
replace_na()
to replaceNA
values with a specified value, such as 0 or “Unknown”.
Column Splitting/Combining:
- Use
separate()
to split a single column into multiple columns based on a delimiter. - Use
unite()
to combine multiple columns into one, optionally separating with a custom delimiter.
Tidy Data Principles
- Each variable is a column, each observation is a row, and each type of observational unit forms a table.
- Ensures compatibility with dplyr for transformations and ggplot2 for visualization, making the analysis pipeline smooth.
Why Use tidyr?
Efficiency
- Tidyr’s concise functions like
pivot_longer()
anddrop_na()
replace complex manual workflows. - Saves time when working with messy datasets with repetitive patterns.
Data Preparation
- Tidyr transforms data into a format that’s more conducive to analysis, visualization, or modeling.
- Cleaning becomes intuitive with functions designed for real-world scenarios like missing data and inconsistent columns.
Integration with Tidyverse
- Integrates seamlessly with dplyr for transformations (
mutate()
,filter()
), ggplot2 for creating visualizations, and purrr for applying functions across data. - Ensures workflows remain consistent and modular, reducing debugging time.
Introduction to tidyr Package
The tidyr
package, developed by Hadley Wickham, provides functions to organize data into tidy format. When combined with dplyr
and magrittr
, it facilitates building robust data analysis pipelines.
Install and loading
To begin using tidyr
, install it with the following commands:
# Install and load the tidyr package
install.packages("tidyr")
library(tidyr)
Key Functions in tidyr
with Examples
Below are the key functions of the tidyr package
R gather()
The gather() function transforms data from a wide to a long format. It collects multiple columns and their values into key
–value
pairs, creating a new column for the keys and another for the values. Columns that should not be included in the gathering process can be excluded by specifying them with the negative operator (-
) within the function.
# Reshape the data to a long format using gather()
library(tidyr)
# Original data frame
df <- data.frame(
Student = c("Geetha", "Ram", "Sai"),
History = c(89, 81, 78),
Math = c(75, 88, 85),
Science = c(85, 92, 90)
)
print("Original Data:")
print(df)
# Reshape the data frame from wide to long
long_df <- gather(df, key = "Subject", value = "Score", -Student)
print("Transformed Data:")
print(long_df)
spread()
The spread() function from the tidyr
package is used to transform data from a long to a wide format by converting key-value pairs into column names and their corresponding values, resulting in a new data frame in a wide format.
# Reshape the data frame from long to wide
wide_df <- spread(long_df, key = Subject, value = score)
print("Transformed Data:")
print(wide_df)
Yields below output.
separate()
separate() function from the tidyr package separates a single column into multiple columns based on a specified delimiter. Let’s update the above data frame by adding a new column and applying this function to split that column into two multiple columns based specified delimiter.
# Reshape the data using seperate()
library(tidyr)
# Original data frame
df <- data.frame(
Student = c("Geetha", "Ram", "Sai"),
History = c(89, 81, 78),
Math = c(75, 88, 85),
Science = c(85, 92, 90)
)
# Update the data frame
df$Total_Percentage = c("249_83%", "261_87%", "253_84%")
print("Original Data:")
print(df)
# Seperate the specified column into two columns
sep_data <- separate(df, col=Total_Percentage, into=c('Total', 'Percentage'), sep='_')
print("Transformed Data:")
print(sep_data)
Yields below output.
# Output:
[1] "Original Data:"
> print(df)
Student History Math Science Total_Percentage
1 Geetha 89 75 85 249_83%
2 Ram 81 88 92 261_87%
3 Sai 78 85 90 253_84%
[1] "Transformed data:"
Student History Math Science Total Percentage
1 Geetha 89 75 85 249 83%
2 Ram 81 88 92 261 87%
3 Sai 78 85 90 253 84%
unite()
unite() function is used to combine multiple columns into a single column. Let’s apply the unite() function to given data to merge specified columns of data frame into single column.
# Reshape the data using unite() function
library(tidyr)
uni_data <- unite(sep_data, col=Total_Percentage, c('Total', 'Percentage'), sep='_')
print("Transformed Data:")
print(uni_data)
Yields below output.
# Output:
[1] "Transformed data:"
> print(uni_data)
Student History Math Science Total_Percentage
1 Geetha 89 75 85 249_83%
2 Ram 81 88 92 261_87%
3 Sai 78 85 90 253_84%
fill()
The fill() function of the tidyr package is used to fill the missing values in a data frame with the previous non-missing value. Let’s modify the last rows of a column in our data frame with missing values (NA
) and then apply this function to replace them with the previous non-missing value.
# Remove NA values using fill()
library(tidyr)
# Update the data frame column with NA values
long_df$Score = c(89, 81, 78, 75, 88, 85, NA, NA, NA)
print("Original Data:")
print(long_df)
# Fill NA value with previous value
fill_na <- fill(long_df, Score)
print("Transformed Data:")
print(fill_na)
Yields below output,
# Output:
[1] "Original Data:"
Student Subject Score
1 Geetha History 89
2 Ram History 81
3 Sai History 78
4 Geetha Math 75
5 Ram Math 88
6 Sai Math 85
7 Geetha Science NA
8 Ram Science NA
9 Sai Science NA
"Transformed Data:"
Student Subject Score
1 Geetha History 89
2 Ram History 81
3 Sai History 78
4 Geetha Math 75
5 Ram Math 88
6 Sai Math 85
7 Geetha Science 85
8 Ram Science 85
9 Sai Science 85
drop_na()
The drop_na() function removes rows with missing values. This function takes the data frame as an argument and removes rows with missing values. It then returns the data frame with the remaining rows.
# Remove rows using drop_na() function
library(tidyr)
long_df$Score = c(89, 81, 78, 75, 88, 85, NA, NA, NA)
remove_data = drop_na(long_df)
print("Transformed Data:")
print(remove_data)
Yields below output.
# Output:
[1] "Transformed Data:"
> print(remove_data)
Student Subject Score
1 Geetha History 89
2 Ram History 81
3 Sai History 78
4 Geetha Math 75
5 Ram Math 88
6 Sai Math 85
pivot_longer()
The pivote_longer() function transforms data from wide to long format. It is a more powerful version of the gather() function, which can handle multiple key-value pairs.
# Transform/reshape the data using pivote_longer()
library(tidyr)
long_df <- pivot_longer(df, cols = c(History, Math, Science),
names_to = 'Subject', values_to = 'Score')
print("Transformed Data:")
print(long_df)
Yields below output.
# Output:
[1] "Transformed Data:"
> print(long_df)
# A tibble: 9 × 3
Student Subject Score
<chr> <chr> <dbl>
1 Geetha History 89
2 Geetha Math 75
3 Geetha Science 85
4 Ram History 81
5 Ram Math 88
6 Ram Science 92
7 Sai History 78
8 Sai Math 85
9 Sai Science 90
pivot_wider()
You can use the pivote_wider() function to transform the data from a long to to a wide format. Compared to the spread()
function, it is more effective and flexible. It can handle more complex transformations.
# Transform the data using pivote_wider()
library(tidyr)
wide_df <- pivot_wider(long_df, names_from = Subject, values_from = Score)
print("Transformed Data:")
print(wide_df)
Yields below output.
# Output:
[1] "Transformed Data:"
> print(wide_df)
# A tibble: 3 × 4
Student History Math Science
<chr> <dbl> <dbl> <dbl>
1 Geetha 89 75 85
2 Ram 81 88 92
3 Sai 78 85 90
separate_rows()
To separate rows of the data frame using the sepatate_rows() function from the tidyr package. First, we will update the the given data frame by adding a row of values as shown below.
Then use the seprate_rows()
functions to separate the row having multiple values into two rows.
# Separtate the rows using separate_rows()
library(tidyr)
# Update the given data frame
remove_data[nrow(remove_data) + 1,] <- c("Geetha", "History, Math", "78")
print("Updated data:")
print(remove_data)
# Seperate the rows from a specified row
sep_data1 <-remove_data %>%
separate_rows(Subject, Score)
print("Transformed Data:")
print(sep_data1)
Yields below output.
# Output:
"Updated data:"
Student Subject Score
1 Geetha History 89
2 Ram History 81
3 Sai History 78
4 Geetha Math 75
5 Ram Math 88
6 Sai Math 85
7 Geetha History, Math 78
[1] "Transformed Data:"
# A tibble: 8 × 3
Student Subject Score
<chr> <chr> <chr>
1 Geetha History 89
2 Ram History 81
3 Sai History 78
4 Geetha Math 75
5 Ram Math 88
6 Sai Math 85
7 Geetha History 78
8 Geetha Math 78
Complete()
The complete()
function in R converts implicit missing values into explicit ones by providing the data frame that includes all possible combinations of specified columns. Missing combinations are filled with NA
or provided default values.
# Transforming data using complete() function
library(tidyr)
complete_data <- remove_data %>% complete(Subject, Score)
print("Transformed Data:")
print(complete_data)
Yields below output.
# Output:
[1] "Transformed Data:"
> print(complete_data)
# A tibble: 18 × 3
Subject Score Student
<chr> <chr> <chr>
1 History 75 NA
2 History 78 Sai
3 History 81 Ram
4 History 85 NA
5 History 88 NA
6 History 89 Geetha
7 History, Math 75 NA
8 History, Math 78 Geetha
9 History, Math 81 NA
10 History, Math 85 NA
11 History, Math 88 NA
12 History, Math 89 NA
13 Math 75 Geetha
14 Math 78 NA
15 Math 81 NA
16 Math 85 Sai
17 Math 88 Ram
18 Math 89 NA
replace_na()
Finally, you can use the replace_na()
function of tidyr package to replace NA (missing) values in a data frame with specified replacement values.
# Replace Na values with specified value
rep_na <- long_df %>%
replace_na(list(Score = -1))
print("Transformed Data:")
print(rep_na)
Yields below output.
# Output:
[1] "Transformed Data:"
Student Subject Score
1 Geetha History 89
2 Ram History 81
3 Sai History -1
4 Geetha Math 75
5 Ram Math -1
6 Sai Math 85
7 Geetha Science -1
8 Ram Science 92
9 Sai Science 90
Conclusion
In this article, I have explained an overview of the tidyr package in the R programming language. Data transformation is a critical step in data preparation and analysis, and the tidyr
package provides an extensive set of tools for this purpose. Functions like gather()
, spread()
, pivot_longer()
, and pivot_wider()
simplify reshaping datasets, while utilities like separate()
, unite()
, and fill()
address common data cleaning challenges.
Happy Learning!!