You are currently viewing R select() Function from dplyr – Usage with Examples

select() is a function of the dplyr R package that is used to select data frame variables by name/index, and also is used to rename variables while selecting, and dropping variables by name. In this article, I will explain the syntax of the select() function, and its usage with examples like selecting specific variables by name/position, selecting variables from the list of names, and many more. Note that in R columns are referred to as variables and rows as observations.

dplyr is an R package that provides a grammar of data manipulation and provides a most used set of verbs that helps data science analysts to solve the most common data manipulation. To use this, you have to install using install.packages('dplyr') and load it using library(dplyr).

Sometimes you may need to change the variable names, if so read rename data frame columns in r.

1. dplyr select() Syntax

Following is the syntax of the select() function of the dplyr package in R. This returns the data frame with selected columns(variables).


# Syntax of select()
select(x, variables_to_select)

Let’s create an R DataFrame, run these examples, and explore the output. If you already have data in CSV you can easily import CSV file to R DataFrame. Also, refer to Import Excel File into R.


# Create DataFrame
df <- data.frame(
  id = c(10,11,12,13),
  name = c('sai','ram','deepika','sahithi'),
  gender = c('M','M','F','F'),
  dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16')),
  state = c('CA','NY','DE',NA),
  row.names=c('r1','r2','r3','r4')
)
df

Yields below output.

R select function dplyr

2. Select Variables by Index Position

To select columns of the R data frame you can use the %>% operator and select() function of the dplyr package. %>% operator is the pipe operator, which is used to implement multiple operations sequentially. When we use dplyr package, we mostly use the Infix operator %>% from magrittr.

In this case, this operator takes the data frame df and loads it into the select() function from dplyr. Then the select() function selects specific columns.

  • df %>% select(2,3) this code returns the data frame with 2 and 3 columns. Remember that the R index starts from 1.
  • df %>% select(c(2,3)) In this code you can pass the list of column indexes(specified with vector) into the select() function it, will return the corresponding columns of passed indexes.
  • df %>% select(2:3) This code returns the columns of the specified range.

Let’s pass the specified column indexes into this function and get the corresponding columns of the passed indexes.


# Load dplyr 
library('dplyr')

# Select columns
df %>% select(2,3)

# Select columns by list of index or position
df %>% select(c(2,3))

# Select columns by index range
df %>% select(2:3)

Yields below output

R select function dplyr

3. Select Variables by Name

You can also select variables(columns) by name in R. You can implement this process in two ways. One is to separate the selected columns with a comma(,) and pass it into the select() function. The second way is to specify the selected columns using a vector and then pass it into the select() function. This function selects specified columns/lists of specified columns with the help of the infix %>% operator and returns the data frame with selected columns.


# Select columns by label name & gender
df %>% select('name','gender')
df %>% select(c('name','gender'))

Yields the same as the above output.

4. Drop Variables

By using select() you can also drop columns from the DataFrame by Name. To drop variables, you can use -(the negation) operator along with the specified variables(specify the variables using a vector). It returns a new DataFrame without the specified variables.


# Select columns except name & gender
df %>% select(-c('name','gender'))

# Output:
#    id        dob state
# r1 10 1990-10-02    CA
# r2 11 1981-03-24    NY
# r3 12 1987-06-14    DE
# r4 13 1985-08-16  <NA>

5. Select All Variables Between 2 Variables

You can also select all variables between two variables, to do so use the range operator (:). The left-hand side of the operator is the starting position and the right-hand side is the end position. The following examples select all variables between name and state variables including starting and ending values.


# Select columns between name and state
df %>% select('name':'state')

# Output:
#       name gender        dob state
# r1     sai      M 1990-10-02    CA
# r2     ram      M 1981-03-24    NY
# r3 deepika      F 1987-06-14    DE
# r4 sahithi      F 1985-08-16  <NA>

6. Select All Variables that start with

Use starts_with() along with the select() to get all variables start with a character string. The following example selects all variables that start with the gen string.


# Select columns starts with a string
df %>% select(starts_with('gen'))

# Output:
#    gender
# r1      M
# r2      M
# r3      F
# r4      F

7. Select All Variables that end with

Use ends_with() along with the select() to get all variables end with a character string. The following example selects all variables that end with the e string.


# Select columns that ends with a string
df %>% select(ends_with('e'))

# Output:
#       name state
# r1     sai    CA
# r2     ram    NY
# r3 deepika    DE
# r4 sahithi  <NA>

8. Select Variables containing character

In case you want to select all variables that contain a character or string use contains(). The following example selects all variables that contain a character a.


# Select columns that contains
df %>% select(contains('a'))

# Output:
#       name state
# r1     sai    CA
# r2     ram    NY
# r3 deepika    DE
# r4 sahithi  <NA>

9. Select All Numeric Variables

Selecting all numeric variables is one of the most used operations. If you have a data frame with variables which are having strings, and integers, and perform certain statistical operations on the entire data frame results in error hence, first you need to select all numeric columns and perform the operation on the result of it.


# Select all numeric columns
df %>% select_if(is.numeric)

# Output:
#    id
# r1 10
# r2 11
# r3 12
# r4 13

10. Complete Example


# Create DataFrame
df <- data.frame(
  id = c(10,11,12,13),
  name = c('sai','ram','deepika','sahithi'),
  gender = c('M','M','F','F'),
  dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16')),
  state = c('CA','NY','DE',NA),
  row.names=c('r1','r2','r3','r4')
)
df

# Load dplyr 
library('dplyr')

# Select columns by list of index or position
df %>% select(c(2,3))
# Select columns by index range
df %>% select(2:3)


# Select columns by label name & gender
df %>% select(c('name','gender'))
df %>% select('name','gender')

# Select columns except name & gender
df %>% select(-c('name','gender'))

# Select columns between name and state
df %>% select('name':'state')

# Select columns starts with a string
df %>% select(starts_with('gen'))

# Select columns not start with a string
df %>% select(-starts_with('gen'))

# Select columns that ends with a string
df %>% select(ends_with('e'))

# Select columns that contains
df %>% select(contains('a'))

# Select all numeric columns
df %>% select_if(is.numeric)

Frequently Asked Questions on R select() Function from dplyr

What does select() do in dplyr?

select() is used to get specific columns from a data frame. It allows you to create a new data frame that has only the columns you are mentioned in.

How do I use select() to choose specific columns?

You can use select() by specifying the data frame and selecting the columns you want to get. For example, df %>% select(col1, col2, col3)

How can I use select() to exclude columns?

You can use the minus sign (-) before the specified columns which, you want to select using the select() function of dplyr to exclude those columns. For example, df %>% select(-c('col1','col1'))

How can I select columns based on a pattern?

Use starts_with(), ends_with(), contains(), etc., to select columns based on a pattern. For example, df %>% select(starts_with("prefix"))

How can I use select() with pipe operator %>%?

To chain the select() function use the pipe operator. For example, df %>% select(col1, col2, col3)

11. Conclusion

In this article, you have learned about the select() function from the dplyr package and using this syntax how we can select the specified variables(columns) by index/name in the R data frame. Also learned using the select() function of the dplyr package we can get the variables in different ways.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium