Filtering by Multiple Conditions in R
** Download the R syntax and data used in this post **
In my last post, I shared the fundamentals of filtering data by a single condition in R. This post examines several approaches to selecting observations from a data frame using multiple conditions.
Most data practitioners engage in complex filtering tasks involving multiple conditions (and often, several variables). Expressions containing logical operators, like AND, OR, or NOT, must be used to implement these filtering procedures. Importantly, combining multiple conditions in an expression is possible using these operators.
AND ( & ) vs. OR ( | )
AND and OR are two (logical) operators used to combine multiple filter conditions. When applied, the AND operator returns only those observations that satisfy all filter conditions. On the other hand, the OR operator returns only those cases that satisfy at least one condition.
For example, say you were interested in creating a subset of your data that only includes participants who identify as low-income women. The AND operator would be the appropriate option, as two conditions must be met: low-income (condition 1) women (condition 2).
In contrast, say you wanted to curate a subset of your data that only includes male participants who have earned either a bachelor's degree or a graduate degree. Here, the OR operator would be best because you do not want to cull a sample of all men. Instead, the interest is in creating a data set that includes men who have earned a bachelor's degree (condition 1) OR men who have earned a graduate degree (condition 2).
NOT ( ! )
NOT is another logical operator. Characterized by the exclamation point ( ! ), NOT negates an expression. So, applying the operator to an expression that normally would return TRUE would cause it to return FALSE and vice versa.
So, say you wanted to create a subset of your data that excludes missing values for all variables. In other words: you want to curate a data set with complete cases. The NOT operator can be used to select rows with complete cases.
SET UP
First, let's load the tidyverse packages and create some fake data to play with. For this example, we are going to create a data frame with 4 variables and 120 observations:
ID: An identification variable (numbers range from 1 to 120)
Gender: A nominal variable indicating whether a participant identifies as a Woman, Non-Binary, or Man
Low_Income: A binary variable indicating whether a participants' household is low-income:
Education: A nominal variable with five categories indicating the highest level of education a participant completed.
### PACKAGES ###
# create notin operator
`%notin%` <- Negate(`%in%`)
# Check if the required packages are installed
# And if they are not installed, install them
pckgs <- c("tidyverse","rvest")
if (any(pckgs %notin% rownames(installed.packages())==TRUE)){
install.packages(pckgs, repos = c(CRAN = "http://cloud.r-project.org"))}
sapply(pckgs, FUN = require, character.only = TRUE)
#install tidyverse if you have never used it before
install.packages("tidyverse")
library(tidyverse)
set.seed(0721)
### CREATE FAKE DATA SET ###
fake_dat <-
tibble(ID = 1:120, Gender = sample(x = c("Woman","Non-Binary","Man",NA_character_), size = 120, replace = TRUE, prob = c(0.35,0.30,0.20,0.15)), Low_Income = sample(x = c(0,1,NA_real_), size = 120, replace = TRUE, prob = c(0.60,0.25,0.15)),Education = sample(x = c("less than high school", "high school graduate", "associate's degree/some college", "bachelor's degree", "graduate degree",NA_real_), size = 120, replace = TRUE))
Let's try some examples.
SEVERAL EXAMPLES
Example 1: Low-Income Women: Say you wanted to create a subset of your data that only includes participants who identify as low-income women.
Option 1: (Single) Square Bracket with which
One option is to use the square bracket operator with the which function.
## Option 1: (Single) Square Bracket & which
fake_dat[which(fake_dat$Low_Income == 1 & fake_dat$Gender == "Woman"),]
Note: the ampersand symbol ( & ) combines the two conditions.
Option 2: subset
subset is another possibility.
## Option 2: subset
subset(fake_dat, Low_Income == 1 & Gender == "Woman")
## Option 3: filter
filter(fake_dat, Low_Income == 1 & Gender == "Woman")
# could also be written as
# fake_dat %>% filter(Low_Income == 1 & Gender == "Woman")
Example 2: Men with a bachelor's degree or graduate degree: As another example, say you wanted to extract a data subset that only includes male participants who have earned either a bachelor's degree or a graduate degree.
Option 1: (Single) Square Bracket with which
Like the previous example, one option is to use the square bracket operator with the which function.
## Option 1: (Single) Square Bracket & which
fake_dat[which(fake_dat$Gender == "Man" & fake_dat$Education == "bachelor's degree" | fake_dat$Gender == "Man" & fake_dat$Education == "graduate degree"),]
## Option 2: filter
filter(fake_dat, Gender == "Man" & Education == "bachelor's degree" | Gender == "Man" & Education == "graduate degree")
# could also be written as
# fake_dat %>% filter(Gender == "Man" & Education == "bachelor's degree" | Gender == "Man" & Education == "graduate degree")
Option 3: subset
But if Base R is more your speed, then subset is yet another possibility.
## Option 3: subset
subset(fake_dat, Gender == "Man" & Education == "bachelor's degree" | Gender == "Man" & Education == "graduate degree")
Options 1 through 3 are practical ways to filter a data set when at least one condition must be met. However, using these methods can also produce lengthy code blocks. (Imagine if there were three or more education categories?)
Option 4: subset / filter & %in%
You can make your code easier to read by creating a vector containing some of the repetitive filter criteria. Here, that translates into creating a character vector with the two education categories "bachelor's degree" and "graduate degree" and using the object in a function like subset or filter:
# create the character vector
edu_cat1 <- c("bachelor's degree","graduate degree")
## Option 4A: filter & %in%
fake_dat %>% filter(Gender == "Man" & Education %in% edu_cat1)
## Option 4B: subset & %in%
subset(fake_dat, Gender == "Man" & Education %in% edu_cat1)
Option 5: subset / filter & grepl, or str_detect
Another approach is to embed the character vector in another function, like grepl or str_detect. However, these functions only accept a vector with a single element for the pattern argument by default. A way around this hurdle is to use regular expressions.
So when paired with grepl or str_detect, extracting the selected rows requires an extra step. For our example, this means separating the two education categories by the OR symbol ( | ):
# create the character vector
edu_cat2 <- paste(c("bachelor's degree","graduate degree"), collapse = "|")
## Option 5A: filter & grepl
fake_dat %>% filter(Gender == "Man" & grepl(edu_cat2,Education))
## Option 5B: filter & str_detect
fake_dat %>% filter(Gender == "Man" & str_detect(Education, edu_cat2))
Example 3: Create a data subset with complete cases. The final example is all about extracting all non-missing observations from the data set.
Option 1: (Single) Square Bracket w/complete.cases
One option is to use the square bracket operator with the complete.cases function. (The complete.cases function produces a logical vector that indicates which observations are NOT missing. So, using the function with the square bracket operator will create a subset without missing observations.)
## Option 1: (Single) Square Bracket & complete.cases
fake_dat[complete.cases(fake_dat),]
## Option 2A: subset
subset(fake_dat, !is.na(Gender) & !is.na(Low_Income) & !is.na(Education))
## Option 2B: filter
filter(fake_dat, !is.na(Gender) & !is.na(Low_Income) & !is.na(Education))
But using these methods can lead to lengthy code blocks if your data set has many variables.
Option 3: filter + across
across is a new-ish function in the tidyverse. across makes it easy to apply a function (or series of functions) across multiple columns. When it comes to creating a subset without missing values, across is a lifesaver. across has two main arguments: (1) columns to transform and (2) functions to apply to each column.
### Option 4: filter + across
fake_dat %>% filter(across(.cols = everything(), ~ !is.na(.)))
# could also be written where variable names are explicitly referenced
# fake_dat %>% filter(across(Gender:Education, ~ !is.na(.)))
Have a favorite filtering method? Use the comments to share your code.