The Fundamentals of Filtering in R
** Download the R syntax and data used in this post **
(Data) Filtering is the process of selecting rows or columns from a data set based on specific criteria. One reason for "filtering" data is to remove variables that are no longer needed. Other reasons may include dropping a subset of observations with missing values or selecting cases where a variable contains a specific string or is equal to a particular value. No matter how you slice it, "filtering" is a fundamental task that allows you to dig deeper into your data and discover new insights.
There are many ways to filter data in R. This post will focus exclusively on how to select observations based on one condition. My next post will offer techniques on how to choose cases for analysis based on multiple exclusion or inclusion criteria.
--
But before we jump into how to apply filtering techniques in R, let's review several operators and functions that are helpful to know.
Relational & Logical Operators
Relational and logical operators are used to evaluate expressions or compare two (or more) conditions or statements.
R has several built-in relational and logical operators. The most frequently used operators include:
Open R and enter the following expression in the console:
Of course, it will return TRUE because 5 is greater than 3. On the other hand, the following expression:
will return FALSE because 7 is greater than 2. For fun, try out the following examples below:
%in% operator
The %in% operator evaluates whether elements of a vector exist in another vector. The operator returns a logical vector (TRUE / FALSE), indicating whether a match was found. As an example, consider the following: You are interested in whether the number 3 exists within the vector t, which is an ordered sequence of numbers from 1 to 15 increasing in increments of 2.
Running the code will return a logical vector of TRUE, as the number 5 exists within the vector "t."
A more complex example:
returns:
As the first, third, and ninth elements of the sequence of numbers 1 through 9 match the values 1,3, and 9 from the variable xy.
Whereas:
returns:
As all three elements of the variable xy exist within the sequence of numbers 1 through 9.
grep & grepl
grep and grepl are Base R functions used for pattern matching. Both functions have two main arguments; (1) A vector of string(s) to match (pattern) and (2) another vector (x) where matches are sought. grep returns an index of the location of the elements of a vector that yielded a (pattern) match. grepl, on the other hand, returns a logical vector.
So, applying the grep function in the example below:
would return:
because the first, second, third, and fifth elements of the character vector c("mary", "had","a","little","lamb") contain the letter a.
Whereas:
returns:
as the fourth element of the vector is the only component that does not contain the letter a.
str_detect & str_which
str_detect and str_which from the stringr package are grep and grepl's more upscale cousins. Like grep, str_which returns the position of a (matched) pattern whereas str_detect returns a logical vector indicating whether there is a match (TRUE) or not (FALSE).
So, using str_which, searching for any occurrences of the letter a in the character vector c("mary", "had","a","little","lamb"):
would return:
And using str_detect:
would produce a logical vector showing TRUE or FALSE where there is a match:
Of course, you can describe, match, and compare more complex patterns in strings. However, that is beyond the scope of this post. Wickham & Grolemund have a great primer in Chapter 14 of their R for Data Science book for those interested in matching patterns with regular expressions.
Okay, now we are ready to jump in!
Filtering by a single condition
Set Up
First, let's load the tidyverse packages and create some fake data to play with. We are going to create a data frame with 4 variables and 80 observations:
ID: An identification variable (numbers range from 1 to 80).
Gender: A nominal variable indicating whether a participant identifies as a Woman, Non-Binary, or Man.
Low_Income: A binary variable indicating whether a participant's household is low-income (1) or not (0).
College_Grad: A binary variable indicating whether a participant graduated from college (1) or not (0).
Dollar Sign ($) Operator
You probably learned about the $ operator when you first started in R. The $ operator is one way to extract (or change) elements of an object.
For instance, you could extract the "Gender" column from the fake_dat data frame like so:
But what if you want to know: which rows contain participants who identify as women?
One solution is to use the which function
which would return the indices of the rows that contain the string "Woman" in the Gender column. Of course, alone, these numbers are not particularly helpful. To select the observations that meet our condition, the row indices must be used within a larger function. This is where the (single) square bracket operator becomes useful.
(Single) Square Bracket ([ ]) Operator)
The square bracket operator is another way to access (or modify) elements of an object. For example, you could subset the fake_dat data frame by the "Gender" column by name, like so:
or by column index (i.e., location of the column in the data frame), like so:
Drawing from our earlier example: say you want to retain all available information for participants who identify as "Women." One possibility is to use the which function with the square bracket operator:
Another is to use the grep function:
You could even save the indices generated using the which or grep functions to a variable and use that variable to select the desired observations.
str_which, which combined with grepl, which paired with str_detect, and which paired with %in% also work.
While all the options I shared are viable ways to subset your data by a single condition, two functions make filtering tasks much easier to implement in the R environment: subset from Base R and filter from the dplyr package.
Subset Function
As its name suggests, subset is a nifty little function that allows you to select variables and filter observations based on certain conditions. The function has three main arguments: (1) an object to be subsetted; (2) an expression specifying which elements (in our case rows) to keep; and (3) an expression specifying which columns from a data frame to retain (by default, all columns are kept).
So, using subset, you could select observations where the value of the Gender column is equal to the string “Woman” like so:
Alternatively, you could incorporate other functions like grepl or str_detect to select these observations:
filter
filter from the dplyr package is a preferred method for many who use R for data analytics. filter has two main arguments, (1) a data frame and (2) a conditional expression that evaluates to TRUE or FALSE. Using the filter function, you could select observations where the value of the Gender column is equal to the string “Woman” like so:
Like subset, functions and operators like grepl, str_detect, or %in% can easily be incorporated into filter to select observations that meet a specific criterion:
There are (almost) an endless number of ways to filter data in R, and this post has only scratched the surface. What is your favorite method for filtering data in R? Share your code in the comments below.
Need help defining more complex filtering tasks? Stay tuned for my next post.