Analytics Made Accessible

View Original

The Fundamentals of Filtering in R

** Download the R syntax and data used in this post **

(Data) Filtering is the process of selecting rows or columns from a data set based on specific criteria. One reason for "filtering" data is to remove variables that are no longer needed. Other reasons may include dropping a subset of observations with missing values or selecting cases where a variable contains a specific string or is equal to a particular value. No matter how you slice it, "filtering" is a fundamental task that allows you to dig deeper into your data and discover new insights.

There are many ways to filter data in R. This post will focus exclusively on how to select observations based on one condition. My next post will offer techniques on how to choose cases for analysis based on multiple exclusion or inclusion criteria.

--

But before we jump into how to apply filtering techniques in R, let's review several operators and functions that are helpful to know.

Relational & Logical Operators

Relational and logical operators are used to evaluate expressions or compare two (or more) conditions or statements.

R has several built-in relational and logical operators. The most frequently used operators include:

See this content in the original post

Open R and enter the following expression in the console:

See this content in the original post

Of course, it will return TRUE because 5 is greater than 3. On the other hand, the following expression:

See this content in the original post

will return FALSE because 7 is greater than 2. For fun, try out the following examples below:

See this content in the original post

%in% operator

The %in% operator evaluates whether elements of a vector exist in another vector. The operator returns a logical vector (TRUE / FALSE), indicating whether a match was found. As an example, consider the following: You are interested in whether the number 3 exists within the vector t, which is an ordered sequence of numbers from 1 to 15 increasing in increments of 2.

See this content in the original post

Running the code will return a logical vector of TRUE, as the number 5 exists within the vector "t."

A more complex example:

See this content in the original post

returns:

See this content in the original post

As the first, third, and ninth elements of the sequence of numbers 1 through 9 match the values 1,3, and 9 from the variable xy.

 Whereas:

See this content in the original post

returns:

See this content in the original post

As all three elements of the variable xy exist within the sequence of numbers 1 through 9.

grep & grepl

grep and grepl are Base R functions used for pattern matching. Both functions have two main arguments; (1) A vector of string(s) to match (pattern) and (2) another vector (x) where matches are sought. grep returns an index of the location of the elements of a vector that yielded a (pattern) matchgrepl, on the other hand, returns a logical vector.

So, applying the grep function in the example below:

See this content in the original post

would return:

See this content in the original post

because the first, second, third, and fifth elements of the character vector c("mary", "had","a","little","lamb") contain the letter a.

Whereas:

See this content in the original post

returns:

See this content in the original post

as the fourth element of the vector is the only component that does not contain the letter a.

str_detect & str_which

str_detect and str_which from the stringr package are grep and grepl's more upscale cousins. Like grep,  str_which returns the position of a (matched) pattern whereas str_detect returns a logical vector indicating whether there is a match (TRUE) or not (FALSE).

So, using str_which, searching for any occurrences of the letter a in the character vector c("mary", "had","a","little","lamb"):

See this content in the original post

would return:

See this content in the original post

And using str_detect:

See this content in the original post

would produce a logical vector showing TRUE or FALSE where there is a match:

See this content in the original post

Of course, you can describe, match, and compare more complex patterns in strings. However, that is beyond the scope of this post. Wickham & Grolemund have a great primer in Chapter 14 of their R for Data Science book for those interested in matching patterns with regular expressions.

Okay, now we are ready to jump in!

Filtering by a single condition

Set Up

First, let's load the tidyverse packages and create some fake data to play with. We are going to create a data frame with 4 variables and 80 observations:

  1. ID: An identification variable (numbers range from 1 to 80).

  2. Gender: A nominal variable indicating whether a participant identifies as a Woman, Non-Binary, or Man.

  3. Low_Income: A binary variable indicating whether a participant's household is low-income (1) or not (0).

  4. College_Grad: A binary variable indicating whether a participant graduated from college (1) or not (0).

See this content in the original post

Dollar Sign ($) Operator

You probably learned about the $ operator when you first started in R. The $ operator is one way to extract (or change) elements of an object.

For instance, you could extract the "Gender" column from the fake_dat data frame like so:

See this content in the original post

But what if you want to know: which rows contain participants who identify as women?

One solution is to use the which function

See this content in the original post

which would return the indices of the rows that contain the string "Woman" in the Gender column. Of course, alone, these numbers are not particularly helpful. To select the observations that meet our condition, the row indices must be used within a larger function. This is where the (single) square bracket operator becomes useful.

 

(Single) Square Bracket ([ ]) Operator)

The square bracket operator is another way to access (or modify) elements of an object. For example, you could subset the fake_dat data frame by the "Gender" column by name, like so:

See this content in the original post

or by column index (i.e., location of the column in the data frame), like so:

See this content in the original post

Drawing from our earlier example: say you want to retain all available information for participants who identify as "Women." One possibility is to use the which function with the square bracket operator:

See this content in the original post

Another is to use the grep function:

See this content in the original post

You could even save the indices generated using the which or grep functions to a variable and use that variable to select the desired observations.

See this content in the original post

str_which, which combined with grepl, which paired with str_detect, and which paired with %in% also work.

See this content in the original post

While all the options I shared are viable ways to subset your data by a single condition, two functions make filtering tasks much easier to implement in the R environment: subset from Base R and filter from the dplyr package.

 

Subset Function

As its name suggests, subset is a nifty little function that allows you to select variables and filter observations based on certain conditions. The function has three main arguments: (1) an object to be subsetted; (2) an expression specifying which elements (in our case rows) to keep; and (3) an expression specifying which columns from a data frame to retain (by default, all columns are kept).

So, using subset, you could select observations where the value of the Gender column is equal to the string “Woman” like so:

See this content in the original post

Alternatively, you could incorporate other functions like grepl or str_detect to select these observations:

See this content in the original post

filter

filter from the dplyr package is a preferred method for many who use R for data analytics. filter has two main arguments, (1) a data frame and (2) a conditional expression that evaluates to TRUE or FALSE. Using the filter function, you could select observations where the value of the Gender column is equal to the string “Woman” like so:

See this content in the original post

Like subset, functions and operators like grepl, str_detect, or %in% can easily be incorporated into filter to select observations that meet a specific criterion:

See this content in the original post

There are (almost) an endless number of ways to filter data in R, and this post has only scratched the surface. What is your favorite method for filtering data in R? Share your code in the comments below.

Need help defining more complex filtering tasks? Stay tuned for my next post.

See this content in the original post