In this R data frame Tutorial with examples, you will learn what is data frame? its features, advantages, modules, packages, and how to use data frame in real-time with sample examples.
All examples provided in this R data frame tutorial are basic, simple, and easy to practice for beginners who are enthusiastic to learn R data frames and advance their careers.
1. What is Data Frame in R?
A data frame in R represents the data in rows and columns similar to pandas DataFrame and SQL. Each column in the data frame is a vector of the same length, in other words, all columns in the data frame should have the same length.
Dataframe in R stores the data in the form of rows and columns similar to RDBMS tables. So it is a two-dimensional data structure such that one dimension refers to the row and another dimension refers to a column. I will cover more on the data frame in the following sections.
In the R data frame columns are referred to as variables and rows are referred to as observations. If you are new to R Programming, I would highly recommend reading the R Programming Tutorial where I have explained R concepts with examples.
R also provides third-party package dplyr which provides a grammar for data manipulation that closely works with data.frame. In order to use this first, you need to install the package in R.
2. Create a DataFrame in R using data.frame()
The first step to exploring the data frame is by creating it. The function
data.frame() is used to create a DataFrame in an easy way. A data frame is a list of variables of the same number of rows with unique row names. Besides this, there are different ways to create a data frame in R.
2.1 Syntax of data.frame()
The following is the syntax of
#data.frame() Syntax data.frame(…, row.names = NULL, check.rows = FALSE, check.names = TRUE, fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors())
You need to follow the below guidelines when creating a DataFrame in R using data.frame() function.
- The input objects passed to
data.frame()should have the same number of rows.
- The column names should be non-empty.
- Duplicate column names are allowed, but you need to use
check.names = FALSE.
- You can assign names to rows using
- Character variables passed to
data.frameare converted to factor columns.
2.2 Create R DataFrame Example
Now, let’s create a DataFrame by using
data.frame() function. This function takes the first argument either list or vector. In R, the Vector contains elements of the same type and the data types can be logical, integer, double, character, complex or raw. You can create a Vector using
# Create Vectors id <- c(10,11,12,13) name <- c('sai','ram','deepika','sahithi') dob <- as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16')) # Create DataFrame df <- data.frame(id,name,dob) # Print DataFrame df
In the above example, I have used the following Vectors as arguments to the
data.frame() function, separated by commas to create a DataFrame.
id– Numeric Vector which stores the numeric values.
name– Character Vector which stores the character values.
dob– Date Vector which stores the date values.
The above example yields the below output. R will create a data frame with the column names/variables with the same names we used for Vector. You can also use
print(df) to print the DataFrame to the console.
# Output id name dob 1 10 sai 1990-10-02 2 11 ram 1981-03-24 3 12 deepika 1987-06-14 4 13 sahithi 1985-08-16
Notice that it by default adds an incremental sequence number to each row in a DataFrame.
Alternatively, you can create a data frame as follows by directly passing the vector to the function, both these create the DataFrame in the same fashion.
# Create DataFrame df <- data.frame( id = c(10,11,12,13), name = c('sai','ram','deepika','sahithi'), dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16')) ) # Print DataFrame df
3. Check the DataFrame Data types
Let’s check the data types of the created DataFrame by using
print(sapply(df, class)). Note that I have not specified the data types of a column while creating hence, R automatically infers the data type based on the data.
# Display datatypes print(sapply(df, class)) # Output # id name dob # "numeric" "Factor" "Date"
You can also use
str(df) to check the data types.
# Display datatypes str(df) # Output 'data.frame': 4 obs. of 3 variables: $ id : num 10 11 12 13 $ name: Factor w/ 4 levels "deepika","ram",..: 4 2 1 3 $ dob : Date, format: "1990-10-02" "1981-03-24" "1987-06-14" "1985-08-16"
4. Using stringsAsFactors Param for Character Data Types
If you notice above the
name column holds characters but its data type is Factor, by default R DataFrame is created with Factor data type for character columns.
You can change this behavior by adding additional param
stringsAsFactors=False while creating a DataFrame.
# Create DataFrame df <- data.frame( id = c(10,11,12,13), name = c('sai','ram','deepika','sahithi'), dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16')), stringsAsFactors=FALSE ) # Print DataFrame str(df)
# Output 'data.frame': 4 obs. of 3 variables: $ id : num 10 11 12 13 $ name: chr "sai" "ram" "deepika" "sahithi" $ dob : Date, format: "1990-10-02" "1981-03-24" "1987-06-14" ...
5. Assign Row Names to DataFrame
You can assign custom names to the R DataFrame rows while creating. Use
row.names param and assign the vector with the row names. Note that the vector
c() size you are using for
row.names should exactly match the size of all columns.
# Create DataFrame with Row Names df <- data.frame( id = c(10,11,12,13), name = c('sai','ram','deepika','sahithi'), dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16')), row.names = c('row1','row2','row3','row4') ) df
Yields below output.
# Output id name dob row1 10 sai 1990-10-02 row2 11 ram 1981-03-24 row3 12 deepika 1987-06-14 row4 13 sahithi 1985-08-16
If you already have a DataFrame, you can use the below approach to assign or change the row names.
# Assign row names to existing DataFrame row.names(df) <- c('row1','row2','row3','row4') df
6. Select Rows and Columns
By using R base bracket notation we can select rows/observations in R by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.
# Select Rows by index df[3,] # Select Rows by list of index values df[c(3,4,6),] # Select Rows by index range df[3:6,] # Select Rows by name df['row3',] # Select Rows by list of names df[c('row1','row3'),] # Using subset subset(df, name %in% c("sai", "ram")) # Load dplyr # Using dplyr::filter library('dplyr') filter(df, name %in% c("sai", "ram"))
Similarly, you can also select columns or variables in R. Additionally use dplyr select() function or dollar in R to select columns.
# R base - Select columns by name df[,"name"] # R base - Select columns from list df[,c("name","gender")] # R base - Select columns by index position df[,c(2,3)] # Load dplyr library('dplyr') # dplyr - Select columns by list of index or position df %>% select(c(2,3)) # Select columns by index range df %>% select(2:3)
7. Rename Column Names
To rename a column in R use either R base functions
names() or use third pary packages like dplyr or data.table
# Change second column to c2 colnames(df) ="c2" # Change the column name by name colnames(df)[colnames(df) == "id"] ="c1"
By using dplyr rename() function to rename columns.
#Change the column name - c1 to id df <- df %>% rename("id" = "c1") # Rename multiple columns by name df <- df %>% rename("id" = "c1", "name" = "c2") # Rename multiple columns by index df <- df %>% rename(col1 = 1, col2 = 2)
8. Update Values
As part of data processing, the first step would be cleaning the data, as part of the cleaning you would be required to replace column values with another value.
# Replace String with Another Stirng on a single column df$name[df$name == 'ram'] <- 'ram krishna' df # Replaces on all columns df[df=="ram"] <- "ram krishna" df # Replace sub string with another String library(stringr) df$name <- str_replace(df$name, "r", "R") print(df)
9. Drop Rows and Columns
10. Handling Missing Values
11. Joint Data Frames
Base function merge() is used to join the data frames in R, this supports inner, left, right, outer and cross joins. The dplyr package and tidyverse package both supports all these basic joins and additionally anti join and semi-join.
# Inner join df2 <- merge(x=emp_df,y=dept_df, by="dept_id") # Inner join on multiple columns df2 <- merge(x=emp_df,y=dept_df, by=c("dept_id","dept_branch_id")) # Inner join on different columns df2 <- merge(x=emp_df,y=dept_df, by.x=c("dept_id","dept_branch_id"), by.y=c("dept_id","dept_branch_id")) # Load dplyr package library(dplyr) # Using dplyr - inner join multiple columns df2 <- emp_df %>% inner_join( dept_df, by=c('dept_id','dept_branch_id')) # Using dplyr - inner join on different columns df2 <- emp_df %>% inner_join( dept_df, by=c('dept_id'='dept_id', 'dept_branch_id'='dept_branch_id')) # Load tidyverse package library(tidyverse) # Inner Join data.frames list_df = list(emp_df,dept_df) df2 <- list_df %>% reduce(inner_join, by='dept_id') df2
12. Sorting & Ordering DataFrame
By using order() function you can sort data.frame rows by column value which arranges the values either in ascending or descending order. By default, this function puts all NA values at the last and provides an option to put them first.
# Create Data Frame df=data.frame(id=c(11,22,33,44,55), name=c("spark","python","R","jsp","java"), price=c(144,NA,321,567,567), publish_date= as.Date( c("2007-06-22", "2004-02-13", "2006-05-18", "2010-09-02","2007-07-20")) ) # Sort Data Frame df2 <- emp_df[order(df$price),] # Sort by multiple columns df2 <- df[order(df$price,df$name ),] # Sort descending order df2 <- df[order(df$price,decreasing=TRUE),] # Sort by putting NA top df2 <- df[order(df$price,decreasing=TRUE, na.last=FALSE),]
13. Import CSV File into Data Frame
If you have a CSV file with columns separated by a delimiter like a comma, pipe e.t.c, you can easily import CSV into an R DataFrame by using
read.csv() function. This function reads the data frame CSV file and converts it into DataFrame.
Let’s read the CSV file and create a DataFrame. Note that read.csv() by default considers you have a comma-delimited CSV file.
# Create DataFrame from CSV file df = read.csv('/Users/admin/file.csv') df # Check the Datatypes str(df)
Yields DataFrame similar to above but the data type of certain columns and assigned as characters. For example,
dob column is assigned as a character. I will cover in a separate article how to change the data type.
# Output 'data.frame': 4 obs. of 3 variables: $ id : int 10 11 12 13 $ name: chr "sai" "ram" "deepika" "sahithi" $ dob : chr "1990-10-02" "1981-03-24" "1987-06-14" "1985-08-16"
14. Other Data Frame Examples
In this R Data Frame tutorial, you have learned what is Data frame? its usage and advantages, how to create it, select rows and columns, rename columns, drop rows and columns, and many more examples.
Happy Learning !!