Select Columns by Name in R

**Download the R syntax and data used in this post **

Two images. Image 1 (left) is of a data set with four columns labelled A,B,C, and D, and columns A and C are highlighted in a blue color. Image 2 (right) is a subset of image 1 and only includes those columns that are highlighted in blue.

When analyzing a data set, sometimes you only want to include columns (or variables) that meet specific criteria. In a recent post, I shared how to extract columns based on data type. This post examines several techniques for selecting columns from a data frame in the R environment by name.

 

SET UP

First, let's load the tidyverse packages and generate some fake data. We are going to create a data frame with four variables and 100 observations:

  1. ID: An identification variable (numbers range from 1 to 80).

  2. Gender: A nominal variable indicating whether a participant identifies as a Woman, Non-Binary, or Man.

  3. Low_Income: A binary variable indicating whether a participant was raised in a low-income household (1) or not (0).

  4. HS_Grad: A binary variable indicating whether a participant graduated from high school (1) or not (0).

#### SET UP ####
# create notin operator
`%notin%` <- Negate(`%in%`)

# Download packages if not available
pckgs <- c("tidyverse")

if (any(pckgs %notin% rownames(installed.packages())==TRUE)){
  install.packages(pckgs, repos = c(CRAN = "http://cloud.r-project.org"))}

# Load packages
sapply(pckgs, FUN = require, character.only = TRUE)

#### Create Data Set ####
set.seed(0721)
fake_dat <- 
tibble(ID  = 1:100, 
       Gender = sample(x = c("Woman","Non-Binary","Man",NA_character_), 
	    size = 100, replace = TRUE, prob = c(0.45,0.20,0.20,0.15)),
       Low_Income = sample(x = c(0,1,NA_real_), size = 100, replace = TRUE, 
		  prob = c(0.70,0.15,0.15)),
       HS_Grad = sample(x = c(0,1,NA_real_), size = 100, replace = TRUE, 
		  prob = c(0.60,0.25,0.15)))

Next, print the names of columns in the data frame:

names(dat)
[1] "ID"         "FT or PT"   "GRE Verbal" "Low Income"

In this first section, I will focus on how to select a single column. The second part of the post will share options for selecting multiple columns by name.

 

Selecting One Column – Base R

Option #1: (Single) Square Bracket ([ ]) Operator)

In Base R, the simplest way to extract a column by name is to place the name enclosed in double quotes within square brackets. For example, you could select the "Gender" column from the fake_dat data set, like so:

fake_dat["Gender"]
    Gender    
         
 1  Woman     
 2  Woman     
 3  Woman     
 4  Non-Binary
 5  Non-Binary

(For brevity, only the first five rows will be shown.)

 

Option #2: subset

Although primarily used for filtering data, the subset function can also select columns by name. And all it takes is two arguments:

  1. x: an object to be subsetted and

  2. select: an expression where you can indicate which columns to select from a data frame.

So, we can rewrite the syntax from option 1 to select the Gender column like so:

subset(fake_dat, select = "Gender")
    Gender    
         
 1  Woman     
 2  Woman     
 3  Woman     
 4  Non-Binary
 5  Non-Binary

Selecting One Column – Tidyverse 

Option #3: select

In tidyverse, the best way to select a single column is using the select function from dplyr. The select function extracts columns from a data frame by their name or their properties (i.e., type). To select a column by name, simply:

  • reference the data frame (or tibble) containing the column followed by the pipe operator ( %>% );

  • type the name select and, within parentheses,

  • list the column's name

So, if you wanted to select the Gender column from our created data set, you could write:

fake_dat %>% select(Gender)
    Gender    
         
 1  Woman     
 2  Woman     
 3  Woman     
 4  Non-Binary
 5  Non-Binary

But what if you want to select more than one column by name? Say we wanted to select both the Gender and the HS_Grad columns. Luckily, you have many options at your disposal.

 

Selecting Multiple Columns – Base R

Option #4: (Single) Square Bracket ([ ]) Operator)

You can still use single square brackets to select multiple columns in Base R. However, you must also enclose each name in quotations and combine them with the c() function, separating each element with a comma. 

For example, to select both Gender and HS_Grad, you can write:

fake_dat[c("Gender", "HS_Grad")]
     Gender      HS_Grad
             	
 1   Woman             0
 2   Woman             0
 3   Woman             1
 4   Non-Binary        0
 5   Non-Binary       NA

Option #5: subset

The subset function can also select multiple columns by name. Using the same form— enclosing each name in quotations, combining them using the c() function, separating each element with a comma— we can accomplish the task using subset:

subset(fake_dat, select = c("Gender", "HS_Grad"))
     Gender      HS_Grad
             	
 1   Woman             0
 2   Woman             0
 3   Woman             1
 4   Non-Binary        0
 5   Non-Binary       NA

Option #6: vector of names + (Single) Square Bracket

You can also store the names of the columns you wish to select in a vector before extracting them from a data frame. You can then use this object to select your columns like so:

names_to_select <- c("Gender", "HS_Grad")
fake_dat[names_to_select]
     Gender      HS_Grad
             	
 1   Woman             0
 2   Woman             0
 3   Woman             1
 4   Non-Binary        0
 5   Non-Binary       NA

Option #7: which and %in%

If you are looking for a more creative albeit complex solution, you can also turn to the which function and %in% operator. The which function allows you to identify elements within an object that meet a given condition. On the other hand, the %in% operator evaluates whether elements of a vector exist in another vector and returns a logical vector (TRUE / FALSE), indicating whether a match was found.

Using both which and %in%, you can select the Gender and HS_Grad columns by:

  1. creating a vector of column names to select;

  2. writing a which function that returns the location of each column in a data frame;

  3. referencing the name of the data frame followed by square brackets; and

  4. enclosing the vector of column names within the square brackets:

names_to_select <- c("Gender", "HS_Grad")
which_names <- which(names(fake_dat) %in% names_to_select)
fake_dat[which_names]
     Gender      HS_Grad
             	
 1   Woman             0
 2   Woman             0
 3   Woman             1
 4   Non-Binary        0
 5   Non-Binary       NA

Selecting Multiple Columns – Tidyverse

Option #8: select

select from dplyr is still your best bet for extracting multiple columns from a data frame in tidyverse. To use this method:

  • reference the data frame (or tibble) followed by the pipe operator ( %>% );

  • type the name select and, within parentheses, list the names of the columns separated by a comma.

fake_dat %>% 
select(c(Gender, HS_Grad))
     Gender      HS_Grad
             	
 1   Woman             0
 2   Woman             0
 3   Woman             1
 4   Non-Binary        0
 5   Non-Binary       NA

There are so many ways to select columns by name in R, and this post has only covered a few of the most common. What is your favorite method? Share your code in the comments below.

Previous
Previous

Text is Sometimes Best

Next
Next

Select Columns of a Specific Type in R