Select Columns by Name in R
**Download the R syntax and data used in this post **
When analyzing a data set, sometimes you only want to include columns (or variables) that meet specific criteria. In a recent post, I shared how to extract columns based on data type. This post examines several techniques for selecting columns from a data frame in the R environment by name.
SET UP
First, let's load the tidyverse packages and generate some fake data. We are going to create a data frame with four variables and 100 observations:
ID: An identification variable (numbers range from 1 to 80).
Gender: A nominal variable indicating whether a participant identifies as a Woman, Non-Binary, or Man.
Low_Income: A binary variable indicating whether a participant was raised in a low-income household (1) or not (0).
HS_Grad: A binary variable indicating whether a participant graduated from high school (1) or not (0).
Next, print the names of columns in the data frame:
In this first section, I will focus on how to select a single column. The second part of the post will share options for selecting multiple columns by name.
Selecting One Column – Base R
Option #1: (Single) Square Bracket ([ ]) Operator)
In Base R, the simplest way to extract a column by name is to place the name enclosed in double quotes within square brackets. For example, you could select the "Gender" column from the fake_dat data set, like so:
(For brevity, only the first five rows will be shown.)
Option #2: subset
Although primarily used for filtering data, the subset function can also select columns by name. And all it takes is two arguments:
x: an object to be subsetted and
select: an expression where you can indicate which columns to select from a data frame.
So, we can rewrite the syntax from option 1 to select the Gender column like so:
Selecting One Column – Tidyverse
Option #3: select
In tidyverse, the best way to select a single column is using the select function from dplyr. The select function extracts columns from a data frame by their name or their properties (i.e., type). To select a column by name, simply:
reference the data frame (or tibble) containing the column followed by the pipe operator ( %>% );
type the name select and, within parentheses,
list the column's name
So, if you wanted to select the Gender column from our created data set, you could write:
But what if you want to select more than one column by name? Say we wanted to select both the Gender and the HS_Grad columns. Luckily, you have many options at your disposal.
Selecting Multiple Columns – Base R
Option #4: (Single) Square Bracket ([ ]) Operator)
You can still use single square brackets to select multiple columns in Base R. However, you must also enclose each name in quotations and combine them with the c() function, separating each element with a comma.
For example, to select both Gender and HS_Grad, you can write:
Option #5: subset
The subset function can also select multiple columns by name. Using the same form— enclosing each name in quotations, combining them using the c() function, separating each element with a comma— we can accomplish the task using subset:
Option #6: vector of names + (Single) Square Bracket
You can also store the names of the columns you wish to select in a vector before extracting them from a data frame. You can then use this object to select your columns like so:
Option #7: which and %in%
If you are looking for a more creative albeit complex solution, you can also turn to the which function and %in% operator. The which function allows you to identify elements within an object that meet a given condition. On the other hand, the %in% operator evaluates whether elements of a vector exist in another vector and returns a logical vector (TRUE / FALSE), indicating whether a match was found.
Using both which and %in%, you can select the Gender and HS_Grad columns by:
creating a vector of column names to select;
writing a which function that returns the location of each column in a data frame;
referencing the name of the data frame followed by square brackets; and
enclosing the vector of column names within the square brackets:
Selecting Multiple Columns – Tidyverse
Option #8: select
select from dplyr is still your best bet for extracting multiple columns from a data frame in tidyverse. To use this method:
reference the data frame (or tibble) followed by the pipe operator ( %>% );
type the name select and, within parentheses, list the names of the columns separated by a comma.
There are so many ways to select columns by name in R, and this post has only covered a few of the most common. What is your favorite method? Share your code in the comments below.