Filtering by Multiple Conditions in R
** Download the R syntax and data used in this post **
In my last post, I shared the fundamentals of filtering data by a single condition in R. This post examines several approaches to selecting observations from a data frame using multiple conditions.
Most data practitioners engage in complex filtering tasks involving multiple conditions (and often, several variables). Expressions containing logical operators, like AND, OR, or NOT, must be used to implement these filtering procedures. Importantly, combining multiple conditions in an expression is possible using these operators.
AND ( & ) vs. OR ( | )
AND and OR are two (logical) operators used to combine multiple filter conditions. When applied, the AND operator returns only those observations that satisfy all filter conditions. On the other hand, the OR operator returns only those cases that satisfy at least one condition.
For example, say you were interested in creating a subset of your data that only includes participants who identify as low-income women. The AND operator would be the appropriate option, as two conditions must be met: low-income (condition 1) women (condition 2).
In contrast, say you wanted to curate a subset of your data that only includes male participants who have earned either a bachelor's degree or a graduate degree. Here, the OR operator would be best because you do not want to cull a sample of all men. Instead, the interest is in creating a data set that includes men who have earned a bachelor's degree (condition 1) OR men who have earned a graduate degree (condition 2).
NOT ( ! )
NOT is another logical operator. Characterized by the exclamation point ( ! ), NOT negates an expression. So, applying the operator to an expression that normally would return TRUE would cause it to return FALSE and vice versa.
So, say you wanted to create a subset of your data that excludes missing values for all variables. In other words: you want to curate a data set with complete cases. The NOT operator can be used to select rows with complete cases.
SET UP
First, let's load the tidyverse packages and create some fake data to play with. For this example, we are going to create a data frame with 4 variables and 120 observations:
ID: An identification variable (numbers range from 1 to 120)
Gender: A nominal variable indicating whether a participant identifies as a Woman, Non-Binary, or Man
Low_Income: A binary variable indicating whether a participants' household is low-income:
Education: A nominal variable with five categories indicating the highest level of education a participant completed.
Let's try some examples.
SEVERAL EXAMPLES
Example 1: Low-Income Women: Say you wanted to create a subset of your data that only includes participants who identify as low-income women.
Option 1: (Single) Square Bracket with which
One option is to use the square bracket operator with the which function.
Note: the ampersand symbol ( & ) combines the two conditions.
Option 2: subset
subset is another possibility.
Example 2: Men with a bachelor's degree or graduate degree: As another example, say you wanted to extract a data subset that only includes male participants who have earned either a bachelor's degree or a graduate degree.
Option 1: (Single) Square Bracket with which
Like the previous example, one option is to use the square bracket operator with the which function.
Note: the OR operator ( | ) separates the two filter criteria.
Option 2: filter
Option 3: subset
But if Base R is more your speed, then subset is yet another possibility.
Options 1 through 3 are practical ways to filter a data set when at least one condition must be met. However, using these methods can also produce lengthy code blocks. (Imagine if there were three or more education categories?)
Option 4: subset / filter & %in%
You can make your code easier to read by creating a vector containing some of the repetitive filter criteria. Here, that translates into creating a character vector with the two education categories "bachelor's degree" and "graduate degree" and using the object in a function like subset or filter:
Option 5: subset / filter & grepl, or str_detect
Another approach is to embed the character vector in another function, like grepl or str_detect. However, these functions only accept a vector with a single element for the pattern argument by default. A way around this hurdle is to use regular expressions.
So when paired with grepl or str_detect, extracting the selected rows requires an extra step. For our example, this means separating the two education categories by the OR symbol ( | ):
Example 3: Create a data subset with complete cases. The final example is all about extracting all non-missing observations from the data set.
Option 1: (Single) Square Bracket w/complete.cases
One option is to use the square bracket operator with the complete.cases function. (The complete.cases function produces a logical vector that indicates which observations are NOT missing. So, using the function with the square bracket operator will create a subset without missing observations.)
But using these methods can lead to lengthy code blocks if your data set has many variables.
Option 3: filter + across
across is a new-ish function in the tidyverse. across makes it easy to apply a function (or series of functions) across multiple columns. When it comes to creating a subset without missing values, across is a lifesaver. across has two main arguments: (1) columns to transform and (2) functions to apply to each column.
Have a favorite filtering method? Use the comments to share your code.