Analytics Made Accessible

View Original

Select Columns of a Specific Type in R

**Download the R syntax and data used in this post **

When analyzing a data set, sometimes you only want to include columns (or variables) that meet certain criteria. For example, your analysis requires only numeric variables or perhaps text data are of interest. This post examines two techniques for selecting columns from a data frame in the R environment, one in Base R and the other in Tidyverse.

 

SET UP

First, load the tidyverse packages and import the data set. For this example, we are going to use a fictitious educational data set with 4 variables and 1276 observations:

  1. ID: An identification variable (numbers range from 1 to 1276)

  2. FT or PT: A categorical variable indicating whether a student applied for full-time (FT) or part-time (PT) admission to the graduate degree program

  3. GRE Verbal: The applicant's GRE Verbal score (scored on a 130-170 scale).

  4. Low Income: A nominal variable indicating whether a participant was raised in a low-income household (Y) or not (N).

See this content in the original post

Base R

In Base R, an easy way to select columns based on data type (e.g., numeric vs. character) is to use the which function in combination with sapply. Using the which function, you can identify elements within an object that meet a given condition. Say you create a vector called "x" with three elements, the numbers one, two, and three. Using which, we can return the position of the element in the vector "x" by writing:

See this content in the original post

In this case, a one was returned because the first element in "x" is the number one. 

sapply (from the family of apply functions) is used to apply a function to each element of a vector. So, if we wanted to add one to each number of the vector x we previously created, we can write:

See this content in the original post

Therefore, by combining sapply and which, we can write a function that will return the index of columns that meet a specific condition. So, say we wanted to extract only those columns that are numeric, we can write:

See this content in the original post

(Note, for brevity, only the first five rows are shown.)

On the other hand, to extract columns that are character class, substitute is.numeric for is.character:

See this content in the original post

Tidyverse

Performing this operation is much simpler in tidyverse. All you need is two lines of code. That said, two functions help power this method: select and where. The select function extracts columns from a data frame by their name or their properties (i.e., type). The where function is a  helper function of sorts that, when paired with select, makes it easy to extract columns that meet a certain condition.

So, to extract only those columns that are numeric, we can write:

See this content in the original post

Replace is.numeric with is.character to return columns that are character class:

See this content in the original post

And there you have it! Pretty simple, right?

The next time you need to select columns with specific data types, use one of my methods.

Do you have a preferred approach? Let me know in the comments.

See this content in the original post