Analytics Made Accessible

View Original

Select Columns by Name in R

**Download the R syntax and data used in this post **

When analyzing a data set, sometimes you only want to include columns (or variables) that meet specific criteria. In a recent post, I shared how to extract columns based on data type. This post examines several techniques for selecting columns from a data frame in the R environment by name.

 

SET UP

First, let's load the tidyverse packages and generate some fake data. We are going to create a data frame with four variables and 100 observations:

  1. ID: An identification variable (numbers range from 1 to 80).

  2. Gender: A nominal variable indicating whether a participant identifies as a Woman, Non-Binary, or Man.

  3. Low_Income: A binary variable indicating whether a participant was raised in a low-income household (1) or not (0).

  4. HS_Grad: A binary variable indicating whether a participant graduated from high school (1) or not (0).

See this content in the original post

Next, print the names of columns in the data frame:

See this content in the original post

In this first section, I will focus on how to select a single column. The second part of the post will share options for selecting multiple columns by name.

 

Selecting One Column – Base R

Option #1: (Single) Square Bracket ([ ]) Operator)

In Base R, the simplest way to extract a column by name is to place the name enclosed in double quotes within square brackets. For example, you could select the "Gender" column from the fake_dat data set, like so:

See this content in the original post

(For brevity, only the first five rows will be shown.)

 

Option #2: subset

Although primarily used for filtering data, the subset function can also select columns by name. And all it takes is two arguments:

  1. x: an object to be subsetted and

  2. select: an expression where you can indicate which columns to select from a data frame.

So, we can rewrite the syntax from option 1 to select the Gender column like so:

See this content in the original post

Selecting One Column – Tidyverse 

Option #3: select

In tidyverse, the best way to select a single column is using the select function from dplyr. The select function extracts columns from a data frame by their name or their properties (i.e., type). To select a column by name, simply:

  • reference the data frame (or tibble) containing the column followed by the pipe operator ( %>% );

  • type the name select and, within parentheses,

  • list the column's name

So, if you wanted to select the Gender column from our created data set, you could write:

See this content in the original post

But what if you want to select more than one column by name? Say we wanted to select both the Gender and the HS_Grad columns. Luckily, you have many options at your disposal.

 

Selecting Multiple Columns – Base R

Option #4: (Single) Square Bracket ([ ]) Operator)

You can still use single square brackets to select multiple columns in Base R. However, you must also enclose each name in quotations and combine them with the c() function, separating each element with a comma. 

For example, to select both Gender and HS_Grad, you can write:

See this content in the original post

Option #5: subset

The subset function can also select multiple columns by name. Using the same form— enclosing each name in quotations, combining them using the c() function, separating each element with a comma— we can accomplish the task using subset:

See this content in the original post

Option #6: vector of names + (Single) Square Bracket

You can also store the names of the columns you wish to select in a vector before extracting them from a data frame. You can then use this object to select your columns like so:

See this content in the original post

Option #7: which and %in%

If you are looking for a more creative albeit complex solution, you can also turn to the which function and %in% operator. The which function allows you to identify elements within an object that meet a given condition. On the other hand, the %in% operator evaluates whether elements of a vector exist in another vector and returns a logical vector (TRUE / FALSE), indicating whether a match was found.

Using both which and %in%, you can select the Gender and HS_Grad columns by:

  1. creating a vector of column names to select;

  2. writing a which function that returns the location of each column in a data frame;

  3. referencing the name of the data frame followed by square brackets; and

  4. enclosing the vector of column names within the square brackets:

See this content in the original post

Selecting Multiple Columns – Tidyverse

Option #8: select

select from dplyr is still your best bet for extracting multiple columns from a data frame in tidyverse. To use this method:

  • reference the data frame (or tibble) followed by the pipe operator ( %>% );

  • type the name select and, within parentheses, list the names of the columns separated by a comma.

See this content in the original post

There are so many ways to select columns by name in R, and this post has only covered a few of the most common. What is your favorite method? Share your code in the comments below.

See this content in the original post