Filtering by Multiple Conditions in R

Apr 13

** Download the R syntax and data used in this post **

An image of a Venn diagram with two circles. Both circles are a light grey color and the where the circles overlap is shaded in a dark grey-almost black color

In my last post, I shared the fundamentals of filtering data by a single condition in R. This post examines several approaches to selecting observations from a data frame using multiple conditions.

Most data practitioners engage in complex filtering tasks involving multiple conditions (and often, several variables). Expressions containing logical operators, like AND, OR, or NOT, must be used to implement these filtering procedures. Importantly, combining multiple conditions in an expression is possible using these operators.

AND ( & ) vs. OR ( | )

AND and OR are two (logical) operators used to combine multiple filter conditions. When applied, the AND operator returns only those observations that satisfy all filter conditions. On the other hand, the OR operator returns only those cases that satisfy at least one condition.

For example, say you were interested in creating a subset of your data that only includes participants who identify as low-income women. The AND operator would be the appropriate option, as two conditions must be met: low-income (condition 1) women (condition 2).

In contrast, say you wanted to curate a subset of your data that only includes male participants who have earned either a bachelor's degree or a graduate degree. Here, the OR operator would be best because you do not want to cull a sample of all men. Instead, the interest is in creating a data set that includes men who have earned a bachelor's degree (condition 1) OR men who have earned a graduate degree (condition 2).

NOT ( ! )

NOT is another logical operator. Characterized by the exclamation point ( ! ), NOT negates an expression. So, applying the operator to an expression that normally would return TRUE would cause it to return FALSE and vice versa.

So, say you wanted to create a subset of your data that excludes missing values for all variables. In other words: you want to curate a data set with complete cases. The NOT operator can be used to select rows with complete cases.

SET UP

First, let's load the tidyverse packages and create some fake data to play with. For this example, we are going to create a data frame with 4 variables and 120 observations:

ID: An identification variable (numbers range from 1 to 120)
Gender: A nominal variable indicating whether a participant identifies as a Woman, Non-Binary, or Man
Low_Income: A binary variable indicating whether a participants' household is low-income:
Education: A nominal variable with five categories indicating the highest level of education a participant completed.

### PACKAGES ###
# create notin operator
`%notin%` <- Negate(`%in%`)

# Check if the required packages are installed
# And if they are not installed, install them
pckgs <- c("tidyverse","rvest")

if (any(pckgs %notin% rownames(installed.packages())==TRUE)){
  install.packages(pckgs, repos = c(CRAN = "http://cloud.r-project.org"))}
  
sapply(pckgs, FUN = require, character.only = TRUE)

#install tidyverse if you have never used it before
install.packages("tidyverse")
library(tidyverse)

set.seed(0721)

### CREATE FAKE DATA SET ###
fake_dat <- 
tibble(ID  = 1:120, Gender = sample(x = c("Woman","Non-Binary","Man",NA_character_), size = 120, replace = TRUE, prob = c(0.35,0.30,0.20,0.15)), Low_Income = sample(x = c(0,1,NA_real_), size = 120, replace = TRUE, prob = c(0.60,0.25,0.15)),Education = sample(x = c("less than high school", "high school graduate", "associate's degree/some college", "bachelor's degree", "graduate degree",NA_real_), size = 120, replace = TRUE))

Let's try some examples.

SEVERAL EXAMPLES

Example 1: Low-Income Women: Say you wanted to create a subset of your data that only includes participants who identify as low-income women.

Option 1: (Single) Square Bracket with which

One option is to use the square bracket operator with the which function.

## Option 1: (Single) Square Bracket & which
fake_dat[which(fake_dat$Low_Income == 1 & fake_dat$Gender == "Woman"),]

Note: the ampersand symbol ( & ) combines the two conditions.

Option 2: subset

subset is another possibility.

## Option 2: subset
subset(fake_dat, Low_Income == 1 & Gender == "Woman")

Option 3: filter

filter from the dplyr package is yet another option.

## Option 3: filter
filter(fake_dat, Low_Income == 1 & Gender == "Woman")
# could also be written as
# fake_dat %>% filter(Low_Income == 1 & Gender == "Woman")

Example 2: Men with a bachelor's degree or graduate degree: As another example, say you wanted to extract a data subset that only includes male participants who have earned either a bachelor's degree or a graduate degree.

Option 1: (Single) Square Bracket with which

Like the previous example, one option is to use the square bracket operator with the which function.

## Option 1: (Single) Square Bracket & which
fake_dat[which(fake_dat$Gender == "Man" & fake_dat$Education == "bachelor's degree" | fake_dat$Gender == "Man" & fake_dat$Education == "graduate degree"),]

Note: the OR operator ( | ) separates the two filter criteria.

Option 2: filter

filter from the dplyr package is also a popular choice.

## Option 2: filter
filter(fake_dat, Gender == "Man" & Education == "bachelor's degree" | Gender == "Man" & Education == "graduate degree")
# could also be written as
# fake_dat %>% filter(Gender == "Man" & Education == "bachelor's degree" | Gender == "Man" & Education == "graduate degree")

Option 3: subset

But if Base R is more your speed, then subset is yet another possibility.

## Option 3: subset
subset(fake_dat, Gender == "Man" & Education == "bachelor's degree" | Gender == "Man" & Education == "graduate degree")

Options 1 through 3 are practical ways to filter a data set when at least one condition must be met. However, using these methods can also produce lengthy code blocks. (Imagine if there were three or more education categories?)

Option 4: subset / filter & %in%

You can make your code easier to read by creating a vector containing some of the repetitive filter criteria. Here, that translates into creating a character vector with the two education categories "bachelor's degree" and "graduate degree" and using the object in a function like subset or filter:

# create the character vector
edu_cat1 <- c("bachelor's degree","graduate degree")

## Option 4A: filter & %in%
fake_dat %>% filter(Gender == "Man" & Education %in%  edu_cat1)

## Option 4B: subset & %in%
subset(fake_dat, Gender == "Man" & Education %in%  edu_cat1)

Option 5: subset / filter & grepl, or str_detect

Another approach is to embed the character vector in another function, like grepl or str_detect. However, these functions only accept a vector with a single element for the pattern argument by default. A way around this hurdle is to use regular expressions.

So when paired with grepl or str_detect, extracting the selected rows requires an extra step. For our example, this means separating the two education categories by the OR symbol ( | ):

# create the character vector
edu_cat2 <- paste(c("bachelor's degree","graduate degree"), collapse = "|")

## Option 5A: filter & grepl
fake_dat %>% filter(Gender == "Man" & grepl(edu_cat2,Education))

## Option 5B: filter & str_detect
fake_dat %>% filter(Gender == "Man" & str_detect(Education, edu_cat2))

Example 3: Create a data subset with complete cases. The final example is all about extracting all non-missing observations from the data set.

Option 1: (Single) Square Bracket w/complete.cases

One option is to use the square bracket operator with the complete.cases function. (The complete.cases function produces a logical vector that indicates which observations are NOT missing. So, using the function with the square bracket operator will create a subset without missing observations.)

## Option 1: (Single) Square Bracket & complete.cases
fake_dat[complete.cases(fake_dat),]

Option 2: subset or filter

subset and filter are also options.

## Option 2A: subset 
subset(fake_dat, !is.na(Gender) & !is.na(Low_Income) & !is.na(Education))

## Option 2B: filter 
filter(fake_dat, !is.na(Gender) & !is.na(Low_Income) & !is.na(Education))

But using these methods can lead to lengthy code blocks if your data set has many variables.

Option 3: filter + across

across is a new-ish function in the tidyverse. across makes it easy to apply a function (or series of functions) across multiple columns. When it comes to creating a subset without missing values, across is a lifesaver. across has two main arguments: (1) columns to transform and (2) functions to apply to each column.

### Option 4: filter + across
fake_dat %>% filter(across(.cols = everything(), ~ !is.na(.)))
# could also be written where variable names are explicitly referenced
# fake_dat %>% filter(across(Gender:Education, ~ !is.na(.)))

Have a favorite filtering method? Use the comments to share your code.

p2p1DataAnalysis

Ama Nyame-Mensah https://www.anyamemensah.com

Filtering by Multiple Conditions in R

Apply a Process to Multiple Data Sets

The Fundamentals of Filtering in R