R Data Frame Tutorial | Learn with Examples

In this R data frame Tutorial with examples, you will learn what is a data frame, its features, advantages, modules, and packages, and how to use data frames in real time with sample examples.

All examples provided in this R data frame tutorial are basic, simple, and easy to practice for beginners who are enthusiastic to learn R data frames and advance their careers.

1. What is a Data Frame in R?

data frame in R is a fundamental data structure that is used for storing and manipulating structured data which is in the format of rows and columns similar to an RDBMS table or spreadsheet. It is a two-dimensional data structure such that one dimension refers to the row and another dimension refers to a column. Each column in the data frame is a Vector of the same length, in other words, all columns in the data frame should have the same length.

Data frame columns are referred to as variables and rows are referred to as observations. If you are new to R Programming, I would highly recommend reading the R Programming Tutorial where I have explained R concepts with examples.

Here are some key characteristics of data frames in R:

  1. Rectangular Structure: A data frame is a rectangular structure where data is organized into rows and columns. Each column represents a variable, and each row represents an observation or a case.
  2. Homogeneous Columns: Each column in a data frame can contain elements of different data types, but all elements within a single column must have the same data type. This allows data frames to handle mixed data, such as numbers, characters, and logical values.
  3. Column Names: Data frames have column names, which are usually used to label and reference variables or attributes. You can access individual columns using these column names.
  4. Row Names: Data frames also have row names, which serve as row identifiers. By default, rows are labeled with sequential numbers, but you can assign custom row names if needed.

R also provides a third-party package dplyr which provides a grammar for data manipulation that closely works with the data frame. In order to use this package, you need to install the package in R.

Advantages of R Data Frames:

  1. Structure and Organization: Data frames provide a structured and organized way to store and work with tabular data. The two-dimensional structure, with rows and columns, makes it easy to understand and manipulate data.
  2. Data Import and Export: Data frames are commonly used for importing data from various sources (e.g., CSV files, Excel spreadsheets, databases) and exporting data to different formats. R provides built-in functions and packages to facilitate these tasks.
  3. Data Exploration and Summary: Data frames are compatible with functions for data exploration, including summary statistics, data visualization, and various plotting libraries. This helps analysts and data scientists gain insights into the data.
  4. Data Manipulation: R provides a rich set of functions and packages (e.g., dplyr, tidyr) specifically designed for data manipulation with data frames. You can filter, transform, reshape, and aggregate data efficiently.

Use Cases of R Data Frames:

  1. Data Analysis: Data frames are the foundation for data analysis in R. You can perform statistical tests, hypothesis testing, and regression analysis with structured data.
  2. Data Visualization: Data frames are compatible with R’s data visualization packages (e.g., ggplot2), allowing you to create a wide range of charts, graphs, and visualizations for data exploration and presentation.
  3. Data Cleaning and Preprocessing: Data frames are used to clean and preprocess data, including handling missing values, dealing with outliers, and standardizing data.
  4. Data Subsetting and Filtering: Analysts use data frames to extract specific subsets of data based on criteria and conditions, facilitating focused analysis.
  5. Merging and Joining Data: Data frames are essential for combining data from multiple sources. You can merge or join data based on common variables to create comprehensive datasets.
  6. Grouped Operations: Packages like dplyr make it easy to perform grouped operations and aggregations on data, making it simple to compute group-wise statistics.
  7. Machine Learning: Many machine learning algorithms in R require data frames as input. You can prepare your data in a data frame format before applying machine learning techniques.
  8. Time Series Analysis: Data frames are used to store and analyze time series data, enabling time-based operations and modeling.
  9. Reporting and Dashboards: Data frames are employed in creating reports and dashboards using RMarkdown, Shiny, and other reporting tools, providing a structured format for data presentation.
  10. Export and Sharing: After analysis and modeling, you can export results as data frames for sharing with colleagues or use in other applications.

2. Create a Data Frame in R using data.frame()

The first step to exploring the data frame is by creating it. The function data.frame() is used to create a Data Frame in an easy way. A data frame is a list of variables of the same number of rows with unique row names. Besides this, there are different ways to create a data frame in R.

2.1 Syntax of data.frame()

The following is the syntax of data.frame() function.


# Syntax of data.frame()
data.frame(…, row.names = NULL, check.rows = FALSE,
           check.names = TRUE, fix.empty.names = TRUE,
           stringsAsFactors = default.stringsAsFactors())

You need to follow the below guidelines when creating a DataFrame in R using data.frame() function.

  • The input objects passed to data.frame() should have the same number of rows.
  • The column names should be non-empty.
  • Duplicate column names are allowed, but you need to use check.names = FALSE.
  • You can assign names to rows using row.names param.
  • Character variables passed to data.frame are converted to factor columns.

2.2 Create R DataFrame Example

Now, let’s create a data frame by using data.frame() function. This function takes the first argument either list or vector. In R, the Vector contains elements of the same type and the data types can be logical, integer, double, character, complex or raw. You can create a Vector using c().


# Create Vectors
id <- c(10,11,12,13)
name <- c('sai','ram','deepika','sahithi')
dob <- as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16'))

# Create DataFrame
df <- data.frame(id,name,dob)

# Print DataFrame
df 

In the above example, I have used the following Vectors as arguments to the data.frame() function, separated by commas to create a data frame.

  • id – Numeric Vector which stores the numeric values.
  • name – Character Vector which stores the character values.
  • dob – Date Vector which stores the date values.

The above example yields the below output. R will create a data frame with the column names/variables with the same names we used for Vector. You can also use print(df) to print the data frame to the console.


# Output:
  id    name        dob
1 10     sai 1990-10-02
2 11     ram 1981-03-24
3 12 deepika 1987-06-14
4 13 sahithi 1985-08-16

Notice that it by default adds an incremental sequence number to each row in a data frame.

Alternatively, you can create a data frame as follows by directly passing the vector to the function, both these create the data frame in the same fashion.


# Create DataFrame
df <- data.frame(
  id = c(10,11,12,13),
  name = c('sai','ram','deepika','sahithi'),
  dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16'))
)

# Print DataFrame
df

3. Check the Data Frame data types

Let’s check the data types of the created Data Frame by using print(sapply(df, class)). Note that I have not specified the data types of a column while creating hence, R automatically infers the data type based on the data.


# Display datatypes
print(sapply(df, class))

# Output:
#         id        name         dob 
#  "numeric"    "Factor"      "Date"

You can also use str(df) to check the data types.


# Display datatypes
str(df)

# Output
'data.frame':	4 obs. of  3 variables:
 $ id  : num  10 11 12 13
 $ name: Factor w/ 4 levels "deepika","ram",..: 4 2 1 3
 $ dob : Date, format: "1990-10-02" "1981-03-24" "1987-06-14" "1985-08-16"

4. Using stringsAsFactors Param for Character Data Types

If you notice above the name column holds characters but its data type is Factor, by default R DataFrame is created with Factor data type for character columns.

You can change this behavior by adding an additional param stringsAsFactors=False while creating a data frame.


# Create DataFrame
df <- data.frame(
  id = c(10,11,12,13),
  name = c('sai','ram','deepika','sahithi'),
  dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16')),
  stringsAsFactors=FALSE
)

# Print DataFrame
str(df)

Yields below output.


# Output:
'data.frame':	4 obs. of  3 variables:
 $ id  : num  10 11 12 13
 $ name: chr  "sai" "ram" "deepika" "sahithi"
 $ dob : Date, format: "1990-10-02" "1981-03-24" "1987-06-14" ...

5. Assign Row Names to DataFrame

You can assign custom names to the R DataFrame rows while creating. Use row.names param and assign the vector with the row names. Note that the vector c() size you are using for row.names should exactly match the size of all columns.


# Create DataFrame with Row Names
df <- data.frame(
  id = c(10,11,12,13),
  name = c('sai','ram','deepika','sahithi'),
  dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16')),
  row.names = c('row1','row2','row3','row4')
)
df

Yields below output.


# Output:
     id    name        dob
row1 10     sai 1990-10-02
row2 11     ram 1981-03-24
row3 12 deepika 1987-06-14
row4 13 sahithi 1985-08-16

If you already have a data frame, you can use the below approach to assign or change the row names.


# Assign row names to existing DataFrame
row.names(df) <- c('row1','row2','row3','row4')
df

6. Select Rows and Columns

By using R base bracket notation we can select rows/observations in R by column value, by index, by name, by condition etc. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.


# Select Rows by index
df[3,]

# Select Rows by list of index values
df[c(3,4,6),]

# Select Rows by index range
df[3:6,]

# Select Rows by name
df['row3',]

# Select Rows by list of names
df[c('row1','row3'),]

# Using subset
subset(df, name %in% c("sai", "ram"))

# Load dplyr 
# Using dplyr::filter
library('dplyr')
filter(df, name %in% c("sai", "ram"))

Similarly, you can also select columns or variables in R. Additionally use dplyr select() function or dollar in R to select columns.


# R base - Select columns by name
df[,"name"]

# R base - Select columns from list
df[,c("name","gender")]

# R base - Select columns by index position
df[,c(2,3)]

# Load dplyr 
library('dplyr')

# dplyr - Select columns by list of index or position
df %>% select(c(2,3))

# Select columns by index range
df %>% select(2:3)

7. Rename Column Names

To rename a column in R use either R base functions colnames() and names() or use third-party packages like dplyr or data.table


# Change second column to c2
colnames(df)[2] ="c2"

# Change the column name by name
colnames(df)[colnames(df) == "id"] ="c1"

By using dplyr rename() function to rename columns.


#Change the column name - c1 to id
df <- df %>% 
    rename("id" = "c1")

# Rename multiple columns by name
df <- df %>% rename("id" = "c1",
                          "name" = "c2")

# Rename multiple columns by index
df <- df %>% 
       rename(col1 = 1, col2 = 2)

8. Update Values

As part of data processing, the first step would be cleaning the data, as part of the cleaning you would be required to replace column values with another value.


# Replace String with Another Stirng on a single column
df$name[df$name == 'ram'] <- 'ram krishna'
df

# Replaces on all columns
df[df=="ram"] <- "ram krishna"
df

# Replace sub string with another String
library(stringr)
df$name <- str_replace(df$name, "r", "R")
print(df)

9. Drop Rows and Columns

drop rows and drop columns



10. Handling Missing Values



11. Joint Data Frames

Base function merge() is used to join the data frames in R, this supports inner, left, right, outer and cross-joins. The dplyr package and tidyverse package both support all these basic joins and additionally anti-join and semi-join.


# Inner join
df2 <- merge(x=emp_df,y=dept_df, 
             by="dept_id")

# Inner join on multiple columns
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"))

# Inner join on different columns
df2 <- merge(x=emp_df,y=dept_df, 
      by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"))

# Load dplyr package
library(dplyr)

# Using dplyr - inner join multiple columns
df2 <- emp_df %>% inner_join( dept_df, 
           by=c('dept_id','dept_branch_id'))

# Using dplyr - inner join on different columns
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))

# Load tidyverse package
library(tidyverse)

# Inner Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(inner_join, by='dept_id')
df2

12. Sorting & Ordering DataFrame

By using order() function you can sort data.frame rows by column value which arranges the values either in ascending or descending order. By default, this function puts all NA values at the last and provides an option to put them first.


# Create Data Frame
df=data.frame(id=c(11,22,33,44,55),
          name=c("spark","python","R","jsp","java"),
          price=c(144,NA,321,567,567),
          publish_date= as.Date(
            c("2007-06-22", "2004-02-13", "2006-05-18",
              "2010-09-02","2007-07-20"))
          )

# Sort Data Frame
df2 <- emp_df[order(df$price),]

# Sort by multiple columns
df2 <- df[order(df$price,df$name ),]

# Sort descending order
df2 <- df[order(df$price,decreasing=TRUE),]

# Sort by putting NA top
df2 <- df[order(df$price,decreasing=TRUE, na.last=FALSE),]

13. Import the CSV File into Data Frame

If you have a CSV file with columns separated by a delimiter like a comma, pipe etc, you can easily import CSV into an R Data Frame by using read.csv() function. This function reads the data frame CSV file and converts it into DataFrame.

r data frame
Read CSV file to create a DataFrame

Let’s read the CSV file and create a data frame. Note that read.csv() by default considers you have a comma-delimited CSV file.


# Create DataFrame from CSV file
df = read.csv('/Users/admin/file.csv')
df
# Check the Datatypes
str(df)

Yields DataFrame similar to above but the data type of certain columns and assigned as characters. For example, dob column is assigned as a character. I will cover in a separate article how to change the data type.


# Output
'data.frame':	4 obs. of  3 variables:
 $ id  : int  10 11 12 13
 $ name: chr  "sai" "ram" "deepika" "sahithi"
 $ dob : chr  "1990-10-02" "1981-03-24" "1987-06-14" "1985-08-16"

14. Other Data Frame Examples

15. Conclusion

In this R Data Frame tutorial, you have learned what is Data frame, its usage and advantages, how to create it, select rows and columns, rename columns, drop rows and columns, and many more examples. Data frames are a fundamental and versatile data structure in R, and they play a crucial role in various aspects of data analysis, from data preparation and exploration to modeling and reporting. Their flexibility and compatibility with R’s vast ecosystem of packages make them a powerful tool for data scientists and analysts.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply