R Data Frame Tutorial | Learn with Examples

In this R data frame Tutorial with examples, you will learn what is a data frame, its features, advantages, modules, and packages, and how to use data frames in real time with sample examples.

1. What is a Data Frame in R?

A data frame in R is a crucial data structure for storing and manipulating structured data in a row-and-column format, similar to a table in a relational database or a spreadsheet. It is two-dimensional, with one dimension representing rows and the other representing columns. Each column in a data frame is a vector of the same length, meaning all columns must have the same number of elements.

In a data frame, columns are known as variables, and rows are known as observations. If you are new to R programming, I highly recommend checking out the R Programming Tutorial, where R concepts are explained with examples.

Here are some key characteristics of data frames in R:

Rectangular Structure: A data frame is a rectangular structure where data is organized into rows and columns. Each column represents a variable, and each row represents an observation or a case.
Homogeneous Columns: Each column in a data frame can contain elements of different data types, but all elements within a single column must have the same data type. This allows data frames to handle mixed data, such as numbers, characters, and logical values.
Column Names: Data frames have column names, which are usually used to label and reference variables or attributes. You can access individual columns using these column names.
Row Names: Data frames also have row names, which serve as row identifiers. By default, rows are labeled with sequential numbers, but you can assign custom row names if needed.

R also provides a third-party package dplyr which provides a grammar for data manipulation that closely works with the data frame. To use this package, you need to install the package in R.

Advantages of R Data Frames:

Structure and Organization: Data frames provide a structured and organized way to store and work with tabular data. The two-dimensional structure, with rows and columns, makes it easy to understand and manipulate data.
Data Import and Export: Data frames are commonly used for importing data from various sources (e.g., CSV files, Excel spreadsheets, databases) and exporting data to different formats. R provides built-in functions and packages to facilitate these tasks.
Data Exploration and Summary: Data frames are compatible with functions for data exploration, including summary statistics, data visualization, and various plotting libraries. This helps analysts and data scientists gain insights into the data.
Data Manipulation: R provides a rich set of functions and packages (e.g., dplyr, tidyr) specifically designed for data manipulation with data frames. You can filter, transform, reshape, and aggregate data efficiently.

Use Cases of R Data Frames:

Data Analysis: Data frames are the foundation for data analysis in R. You can perform statistical tests, hypothesis testing, and regression analysis with structured data.
Data Visualization: Data frames are compatible with R’s data visualization packages (e.g., ggplot2), allowing you to create a wide range of charts, graphs, and visualizations for data exploration and presentation.
Data Cleaning and Preprocessing: Data frames are used to clean and preprocess data, including handling missing values, dealing with outliers, and standardizing data.
Data Subsetting and Filtering: Analysts use data frames to extract specific subsets of data based on criteria and conditions, facilitating focused analysis.
Merging and Joining Data: Data frames are essential for combining data from multiple sources. You can merge or join data based on common variables to create comprehensive datasets.
Grouped Operations: Packages like dplyr make it easy to perform grouped operations and aggregations on data, making it simple to compute group-wise statistics.
Machine Learning: Many machine learning algorithms in R require data frames as input. You can prepare your data in a data frame format before applying machine learning techniques.
Time Series Analysis: Data frames are used to store and analyze time series data, enabling time-based operations and modeling.
Reporting and Dashboards: Data frames are employed in creating reports and dashboards using RMarkdown, Shiny, and other reporting tools, providing a structured format for data presentation.
Export and Sharing: After analysis and modeling, you can export results as data frames for sharing with colleagues or use in other applications.

2. Initialize a Data Frame in R using data.frame()

To explore a data frame, the first step is to create one. You can easily create a data frame using the data.frame() function. To do this, simply pass a list of vectors of the same length as an argument to the function. Each vector represents a column in the data frame, and it’s important to ensure that the length of each column is equal to the number of rows in the data frame. Besides this, there are different ways to create a data frame in R.

2.1 Syntax of data.frame()

Below is the syntax of data.frame() function.


# Syntax of data.frame()
data.frame(…, row.names = NULL, check.rows = FALSE,
           check.names = TRUE, fix.empty.names = TRUE,
           stringsAsFactors = default.stringsAsFactors())

The following are the parameters of the data.frame() function.

row.names: It specifies the row names of the data frame. When we set row.names = NULL, means no row names are set for the data frame. If you want to assign row names, you can provide a vector of names here.
check.rows: This is a logical parameter. If set to TRUE, it checks that each row has the same number of columns as the first row. This can help identify errors in the data input. By default, it’s set to FALSE.
check.names: Another logical parameter. When set to TRUE, it checks and adjusts the names of the variables. For example, it might remove spaces or special characters from the column names. Default is TRUE.
fix.empty.names: A logical parameter that determines whether to fix empty names in the column names. If set to TRUE, empty names will be replaced with a unique name. Default is TRUE.
stringsAsFactors: A logical parameter that determines if character vectors should be converted to factors (categorical variables) by default. This is often set globally in R through options(stringsAsFactors = TRUE/FALSE). When set to TRUE, character vectors are converted to factors, and when set to FALSE, they remain as character vectors. Default behavior depends on the version of R and settings.

2.2 Create R DataFrame Example

To initialize a data frame in R, you can use the data.frame() function That takes a list or vector as its first argument. In R, a vector contains elements of the same data type, such as logical, integer, double, character, complex, or raw. Let’s create vectors of equal length and pass them into this function to get the data frame.


# Create Vectors
id <- c(10,11,12,13)
name <- c('sai','ram','deepika','sahithi')
dob <- as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16'))

# Create DataFrame using vector
df <- data.frame(id,name,dob)

# Print DataFrame
print("Create data frame:")
df

As you can see from the above example, I have created a data frame using vectors.

Yields the below output.

By default, the data frame assigns sequential numbers as row indexes, starting from 1.

In another way, you can create a data frame using vectors. Let’s create vectors within the data.frame() function and create a data frame of specified dimension.


# Create DataFrame
df <- data.frame(
  id = c(10,11,12,13),
  name = c('sai','ram','deepika','sahithi'),
  dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16'))
)

# Print DataFrame
df

Yields the same as the output.

3. Get the Type of Data Frame

To get the type of each column of the data frame you can use the sapply() function. For that, you need to pass the class parameter along with the data frame into this function. It will return the data type of each column present in the data frame.


# Display datatypes
print(sapply(df, class)) 

# Output:
#         id        name         dob 
#  "numeric"    "Factor"      "Date"

You can also check the type of the data frame using the str() function. To get the data type of each column of the data frame using this function you can simply pass the data frame into this function. It will return the data type of each column very explicitly.


# Display datatypes
str(df)

# Output
# 'data.frame':	4 obs. of  3 variables:
#  $ id  : num  10 11 12 13
#  $ name: Factor w/ 4 levels "deepika","ram",..: 4 2 1 3
#  $ dob : Date, format: "1990-10-02" "1981-03-24" "1987-06-14" "1985-08-16"

4. Set stringsAsFactors Param as FALSE

In R programming language, by default, assigns the data type of FACTOR to every character column. This means that the values in the column are treated as categories. However, sometimes it may be necessary to consider these columns as character strings instead. To do this, you can set the stringsAsFactors parameter to FALSE when creating the data frame. This will ensure that the character columns are treated as strings, allowing for more flexibility and accuracy in your data analysis.


# Create DataFrame
df <- data.frame(
  id = c(10,11,12,13),
  name = c('sai','ram','deepika','sahithi'),
  dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16')),
  stringsAsFactors=FALSE
)

# Print DataFrame
str(df)

Yields below output.


# Output:
'data.frame':	4 obs. of  3 variables: 
 $ id  : num  10 11 12 13
 $ name: chr  "sai" "ram" "deepika" "sahithi"
 $ dob : Date, format: "1990-10-02" "1981-03-24" "1987-06-14" ...

5. Assign Customize Row Names to DataFrame

By default data frame rows assign row names numerically, starting from 1 that uniquely identifies each row. You can customize the row names using the row.name attribute. You can simply, pass a list of custom row names, ensuring it matches the size of each column, and pass it into the c() function.


# Create DataFrame with Row Names
df <- data.frame(
  id = c(10,11,12,13),
  name = c('sai','ram','deepika','sahithi'),
  dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16')),
  row.names = c('row1','row2','row3','row4')
)
df

Yields below output.


# Output:
     id    name        dob
row1 10     sai 1990-10-02
row2 11     ram 1981-03-24
row3 12 deepika 1987-06-14
row4 13 sahithi 1985-08-16

You can also assign the custom row names after initializing the data frame using the row.names() function. Let’s add a list of row names to the data frame by passing it into the row.names() function.


# Assign row names to existing DataFrame
row.names(df) <- c('row1','row2','row3','row4')
df

6. Select Rows and Columns

By using R base bracket notation we can select rows/observations in R by column value, by index, by name, by condition etc. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.


# Select Rows by index
df[3,]

# Select Rows by list of index values
df[c(3,4,6),]

# Select Rows by index range
df[3:6,]

# Select Rows by name
df['row3',]

# Select Rows by list of names
df[c('row1','row3'),]

# Using subset
subset(df, name %in% c("sai", "ram"))

# Load dplyr 
# Using dplyr::filter
library('dplyr')
filter(df, name %in% c("sai", "ram"))

Similarly, you can also select columns or variables in R. Additionally use dplyr select() function or dollar in R to select columns.


# R base - Select columns by name
df[,"name"]

# R base - Select columns from list
df[,c("name","gender")]

# R base - Select columns by index position
df[,c(2,3)]

# Load dplyr 
library('dplyr')

# dplyr - Select columns by list of index or position
df %>% select(c(2,3))

# Select columns by index range
df %>% select(2:3)

7. Rename Column Names

To rename a column in R use either R base functions colnames() and names() or use third-party packages like dplyr or data.table.


# Change second column to c2
colnames(df)[2] ="c2"

# Change the column name by name
colnames(df)[colnames(df) == "id"] ="c1"

By using dplyr rename() function to rename columns.


# Change the column name - c1 to id
df <- df %>% 
    rename("id" = "c1")

# Rename multiple columns by name
df <- df %>% rename("id" = "c1",
                          "name" = "c2")

# Rename multiple columns by index
df <- df %>% 
       rename(col1 = 1, col2 = 2)

8. Update Values

As part of data processing, the first step would be cleaning the data, as part of the cleaning you would be required to replace column values with another value.


# Replace String with Another Stirng on a single column
df$name[df$name == 'ram'] <- 'ram krishna'
df

# Replaces on all columns
df[df=="ram"] <- "ram krishna"
df

# Replace sub string with another String
library(stringr)
df$name <- str_replace(df$name, "r", "R")
print(df)

9. Drop Rows

Some times for data cleaning process we need to delete rows/columns of data frame by index. You can use R base df[] notation to delete single row/multiple rows from R DataFrame by negative row index. Let’s see some examples on delete rows from data frame.


# Remove specified row by index 
df1 <- df[-4,]
df1

# Output:
#   id    name        dob
# 1 10     sai 1990-10-02
# 2 11     ram 1981-03-24
# 3 12 deepika 1987-06-14


# Delete 4th,5th and 1st rows
df1 <- df[-c(4,3,1),]
df1

# Output:
#   id name        dob
# 2 11  ram 1981-03-24


# delete rows by range
df1 <- df[-(1:3),]
df1

# Output:
#   id    name        dob
# 4 13 sahithi 1985-08-16

10 Drop Columns

Alternatively, you can use the R base bracket notation df[] to remove the column by index. You can specify selected column index/indexes within a df[] notation and delete those index/indexes using negative(-) operator. Let’s see some examples how to implement these tasks.


# Remove Columns by Index
df1 <- df[,-2]
df1

# Output:
#   id        dob
# 1 10 1990-10-02
# 2 11 1981-03-24
# 3 12 1987-06-14
# 4 13 1985-08-16

# Remove specified range of columns 
df1 <- df[,-2:-4]
df1

# Output:
# [1] 10 11 12 13


# Remove Multiple columns
df1 <- df[,-c(2,3)]
df1

# Output:
# [1] 10 11 12 13

11. Handling Missing Values

You can use R base functions like na.omit(), complete.cases(), and rowSums() methods to remove rows that contain NA (missing values) from the R dataframe. Let’s see how to handle the missing values.


# Create dataframe with 5 rows and 3 columns
df=data.frame(id=c(2,1,3,4,NA),
       name=c('sravan',NA,'chrisa','shivgami',NA),
       gender=c(NA,'m',NA,'f',NA))

# display dataframe
print(df)

# Remove rows with NA's using na.omit()
print(na.omit(df))

# Remove rows with NA's using complete.cases
print(df[complete.cases(df), ] )


# Remove rows with NA's using rowSums()
print(df[rowSums(is.na(df)) == 0, ]  )

# Output
#   id     name gender
# 4  4 shivgami      f

12. Joint Data Frames

Base function merge() is used to join the data frames in R, this supports inner, left, right, outer and cross-joins. The dplyr package and tidyverse package both support all these basic joins and additionally anti-join and semi-join.


# Inner join
df2 <- merge(x=emp_df,y=dept_df, 
             by="dept_id")

# Inner join on multiple columns
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"))

# Inner join on different columns
df2 <- merge(x=emp_df,y=dept_df, 
      by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"))

# Load dplyr package
library(dplyr)

# Using dplyr - inner join multiple columns
df2 <- emp_df %>% inner_join( dept_df, 
           by=c('dept_id','dept_branch_id'))

# Using dplyr - inner join on different columns
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))

# Load tidyverse package
library(tidyverse)

# Inner Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(inner_join, by='dept_id')
df2

13. Sorting & Ordering DataFrame

By using the order() function you can sort data.frame rows by column value which arranges the values either in ascending or descending order. By default, this function puts all NA values at the last and provides an option to put them first.


# Create Data Frame
df=data.frame(id=c(11,22,33,44,55),
          name=c("spark","python","R","jsp","java"),
          price=c(144,NA,321,567,567),
          publish_date= as.Date(
            c("2007-06-22", "2004-02-13", "2006-05-18",
              "2010-09-02","2007-07-20"))
          )

# Sort Data Frame
df2 <- emp_df[order(df$price),]

# Sort by multiple columns
df2 <- df[order(df$price,df$name ),]

# Sort descending order
df2 <- df[order(df$price,decreasing=TRUE),]

# Sort by putting NA top
df2 <- df[order(df$price,decreasing=TRUE, na.last=FALSE),]

14. Import the CSV File into Data Frame

Alternatively, you can create a data frame by reading data from external sources like CSV files using <a href="https://sparkbyexamples.com/r-programming/r-read-csv-file-with-examples/">read.csv()</a> function.

Let’s read the CSV file and create a data frame.


# Create DataFrame from CSV file
df = read.csv('/Users/admin/file.csv')
df
# Check the Datatypes
str(df)

Yields the same as the output but with varying data type columns compared to the R base functions.


# Output
'data.frame':	4 obs. of  3 variables:
 $ id  : int  10 11 12 13
 $ name: chr  "sai" "ram" "deepika" "sahithi"
 $ dob : chr  "1990-10-02" "1981-03-24" "1987-06-14" "1985-08-16"

15. Conclusion

In this R Data Frame tutorial, you have learned what is Data frame, its usage and advantages, how to create it, select rows and columns, rename columns, drop rows and columns, and many more examples. Data frames are a fundamental and versatile data structure in R, and they play a crucial role in various aspects of data analysis, from data preparation and exploration to modeling and reporting. Their flexibility and compatibility with R’s vast ecosystem of packages make them a powerful tool for data scientists and analysts.

Happy Learning !!

References

https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame