You are currently viewing Subset Data Frame in R with Examples

To get the subset of the data frame by rows & columns in R, either use the subset() function , filter() from dplyr package or R base square bracket notation df[]. The subset() is a generic R function that is used to subset the data frame by the rows and columns (In R terms observations & variables) based on single/multiple conditions. Also used to get a subset of vectors, and subset of matrices.

In this article, I will explain different ways of subsetting a DataFrame by rows and columns. Alternatively, you can also select rows in R using df[] notation.

1. Create DataFrame

Let’s create a DataFrame in R, and run the examples to subset data.frame (DataFrame) by rows and columns. and explore the output.


# Create DataFrame
df <- data.frame(
  id = c(10,11,12,13,14,15,16,17),
  name = c('sai','ram','deepika','sahithi','kumar','scott','Don','Lin'),
  gender = c('M','M',NA,'F','M','M','M','F'),
  dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16',
                  '1995-03-02','1991-6-21','1986-3-24','1990-8-26')),
  state = c('CA','NY',NA,NA,'DC','DW','AZ','PH'),
  row.names=c('r1','r2','r3','r4','r5','r6','r7','r8')
)
df

Yields below output.

r subset data frame

2. Subset DataFrame by Rows

In R, a subset() function is used to subset the data frame by the observations and variables. Also used to get a subset of vectors and a subset of matrices.

2.1 Syntax of the subset()

Following is the syntax of the subset() function


# Syntax of the subset() function
subset(x, subset, select, drop = FALSE, …)

This function takes four arguments where the first argument is the input object x, the second argument is the subset expression, the third is to specify what variables to select, and the fourth argument is drop.

This function returns a subset of a data frame by rows and columns based on single/multiple conditions.


# subset a data frame by specifed row name
subset(df, subset=rownames(df) == 'r1')

# Output:
#    id name gender        dob state
# r1 10  sai      M 1990-10-02    CA 

# subset a data frame by vector of row names(multiple rows)
subset(df, rownames(df) %in% c('r1','r2','r3'))

# Output:
#    id    name gender        dob state
# r1 10     sai      M 1990-10-02    CA
# r2 11     ram      M 1981-03-24    NY
# r3 12 deepika   <NA> 1987-06-14  <NA>

# subset a data frame  based on condition
subset(df, gender == 'M')

# Output:
#    id  name gender        dob state
# r1 10   sai      M 1990-10-02    CA
# r2 11   ram      M 1981-03-24    NY
# r5 14 kumar      M 1995-03-02    DC
# r6 15 scott      M 1991-06-21    DW
# r7 16   Don      M 1986-03-24    AZ

# subset a data frame by condition with %in%
subset(df, state %in% c('CA','DC'))

# Output:
#    id  name gender        dob state
# r1 10   sai      M 1990-10-02    CA
# r5 14 kumar      M 1995-03-02    DC

# subset a data farme by multiple conditions using |
subset(df, gender == 'M' | state == 'PH')

# Output:
#    id  name gender        dob state
# r1 10   sai      M 1990-10-02    CA
# r2 11   ram      M 1981-03-24    NY
# r5 14 kumar      M 1995-03-02    DC
# r6 15 scott      M 1991-06-21    DW
# r7 16   Don      M 1986-03-24    AZ
# r8 17   Lin      F 1990-08-26    PH


# subset a data frame by multiple conditions using &
subset(df, gender == 'M' & state %in% c('CA','NY'))

# Output:
#    id name gender        dob state
# r1 10  sai      M 1990-10-02    CA
# r2 11  ram      M 1981-03-24    NY

2.1 Using df[] Notation

By using bracket notation on the R data frame we can subset the data frame by rows based on single/multiple/range of row indexes, column values, and single/multiple conditions.


# Subset a data frame by Row Index
df[3,]

# Output:
#    id    name gender        dob state
# r3 12 deepika   <NA> 1987-06-14  <NA>

# Subset a data frame by List of row indexex 
df[c(3,4,6),]

# Output:
#    id    name gender        dob state
# r3 12 deepika   <NA> 1987-06-14  <NA>
# r4 13 sahithi      F 1985-08-16  <NA>
# r6 15   scott      M 1991-06-21    DW

# Select Rows by Index Range
df[3:6,]

# Output:
#    id    name gender        dob state
# r3 12 deepika   <NA> 1987-06-14  <NA>
# r4 13 sahithi      F 1985-08-16  <NA>
# r5 14   kumar      M 1995-03-02    DC
# r6 15   scott      M 1991-06-21    DW

# Subset a data frame by column value
df[df$gender == 'M',]

# Output:
#    id  name gender        dob state
# r1 10   sai      M 1990-10-02    CA
# r2 11   ram      M 1981-03-24    NY
# NA NA  <NA>   <NA>       <NA>  <NA>
# r5 14 kumar      M 1995-03-02    DC
# r6 15 scott      M 1991-06-21    DW
# r7 16   Don      M 1986-03-24    AZ

# Subset a data frame by vector ofcolumn Values
df[df$state %in% c('CA','AZ','PH'),]

# Output:
#    id name gender        dob state
# r1 10  sai      M 1990-10-02    CA
# r7 16  Don      M 1986-03-24    AZ
# r8 17  Lin      F 1990-08-26    PH

# Subset a data frame byrows based on multiple conditions
df[df$gender == 'M' & df$id > 15,]

# Output:
#    id name gender        dob state
# r7 16  Don      M 1986-03-24    AZ

3. Subset DataFrame Columns

In this section, I will cover how to subset DataFrame (data.frame) columns by using the subset() method, df[] notation, and filter() from dplyr package.

3.1 Using subset() Function

The below examples subset’s DataFrame (data.frame) columns by name and index.


#subset a data frame column Names
subset(df,gender=='M',select=c('id','name','gender'))

# Output:
#    id  name gender
# r1 10   sai      M
# r2 11   ram      M
# r5 14 kumar      M
# r6 15 scott      M
# r7 16   Don      M

# subset  a data frame by column Indexes
subset(df,gender=='M',select=c(1,2,3))

# Output:
# The output same as the above

3.2 Using df[] Notation

By using df[] notation you can also subset the columns. From the following, the example gets the columns with indices 2 and 3 and the second gets the same result but uses the column names.


# Subset a data frame by vector of columns with indices 2 & 3
df[,c(2,3)]
or
# Subset a data frame by vector of columns with name and gender
df[,c('name','gender')]

# Output:
#       name gender
# r1     sai      M
# r2     ram      M
# r3 deepika   <NA>
# r4 sahithi      F
# r5   kumar      M
# r6   scott      M
# r7     Don      M
# r8     Lin      F

4. Using filter() Function

Similarly, you can also subset the data.frame by using filter() function from dplyr package. To use this, you have to install it first using install.packages('dplyr') and load it using library(dplyr).


# Using dplyr::filter subset a data frame
dplyr::filter(df, state %in% c("CA", "AZ", "PH"))

# Output:
#    id name gender        dob state
# r1 10  sai      M 1990-10-02    CA
# r7 16  Don      M 1986-03-24    AZ
# r8 17  Lin      F 1990-08-26    PH

5. Complete Example of R Subset Data Frame


# Create DataFrame
df <- data.frame(
  id = c(10,11,12,13,14,15,16,17),
  name = c('sai','ram','deepika','sahithi','kumar','scott','Don','Lin'),
  gender = c('M','M',NA,'F','M','M','M','F'),
  dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16',
                  '1995-03-02','1991-6-21','1986-3-24','1990-8-26')),
  state = c('CA','NY',NA,NA,'DC','DW','AZ','PH'),
  row.names=c('r1','r2','r3','r4','r5','r6','r7','r8')
)
df

# subset by row name
subset(df, subset=rownames(df) == 'r1') 

# subset row by vector of row names
subset(df, rownames(df) %in% c('r1','r2','r3'))

# subset by condition
subset(df, gender == 'M')

# subset by condition with %in%
subset(df, state %in% c('CA','DC'))

# subset by multiple conditions using |
subset(df, gender == 'M' | state == 'PH')

# subset by multiple conditions using &
subset(df, gender == 'M' & state %in% c('CA','NY'))

# subset Rows by Index
df[3,]

# subset Rows by List of Index Values
df[c(3,4,6),]

# subset Rows by Index Range
df[3:6,]

# subset Rows by column value
df[df$gender == 'M',]

# subset Rows by vector of Values
df[df$state %in% c('CA','AZ','PH'),]

# subset Rows by Checking multiple conditions
df[df$gender == 'M' & df$id > 15,]

# Using dplyr::filter
dplyr::filter(df, state %in% c("CA", "AZ", "PH"))

# Subset columns by Name
subset(df,gender=='M',select=c('id','name','gender'))

# subset columns by Index
subset(df,gender=='M',select=c(1,2,3))

# subset columns with indices 2 & 3
df[,c(2,3)]

# subset columns name and gender
df[,c('name','gender')]

6. Conclusion

In this article, you have learned how to Subset the data frame by rows and columns in R using the subset() function, filter() from dplyr package, and using df[] notation.

Related Articles

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium