Apply a Process to Multiple Data Sets
I spend a fair amount of time convincing people to streamline their data processes and workflows. One question I routinely get asked is how do you apply the same function to multiple data files? This question comes up most often when I talk with folks who have multiple data sets with the same variables. Let me tell you what I mean.
Sometimes, information retrieved from a database can only be downloaded in batches, say 1000 records at a time. Other times, data from the same source are stored in multiple files. Either way, the data contained within the resulting files have the same variables but have different respondents.
And because these files all have the same structure (e.g., the same column names, the same number of columns), the same procedure can be applied to all data sets.
In this post, I will share my favorite methods for applying the same process to multiple data frames in R. And yes, two of those methods use functions from the apply family!
Set Up
About the Data
This example relies on a database of 5000 cases representing applicants for admission to all graduate degree programs at a fictitious university. The database was divided into four data sets. Each data set has 14 variables and more than one thousand observations:
ID: An identification variable (numbers range from 1 to 5000).
Email: Email address the applicant used to apply to their degree program.
Email (preferred): The applicant's preferred email address.
Decision: A nominal variable indicating whether the program admitted (A) or rejected (R) the applicant, or if the application was withdrawn before a decision was rendered (W).
Application Response: A nominal variable indicating whether the applicant Accepted (A) or Declined (D) the program's offer of enrollment.
FT or PT: A nominal variable indicating whether the student applied for Full-Time (ft) or Part-time (pt) enrollment, or if they are a submatriculate (sub).
GRE Verbal: The applicant's GRE Verbal score (scored on a 130-170 scale).
GRE Quant: The applicant's GRE Quantitative score (scored on a 130-170 scale).
GRE Analytical: The applicant's GRE Analytical Writing score (scored on a scale of 0–6 in half-point increments).
Applied Dual Degree: A nominal variable indicating whether the applicant applied for applied for dual enrollment (YES) or not (NODUAL).
Birth Date: A variable indicating the Month, Day, and Year the applicant was born (MM/DD/YYYY).
Gender: A nominal variable indicating whether a participant identifies as a Man (M), Non-Binary (NB), or Woman (W).
Ethnicity: A nominal variable indicating the ethnicity and/or race of the applicant; American Indian or Native Alaskan (AN), Asian or Pacific Islander (AP), Black or African American (B), Hispanic (HL), White (W), Middle Eastern (ME).
Low Income: A nominal variable indicating whether an applicant's household is low-income (Y) or not (N).
Load Packages + Import Data
Next, load the data.table, mgsub, tidyverse packages, and the four data sets. During this step, create a character string object containing four names; in the final step, those names are assigned to the list containing the four imported data sets.
Apply a Process to Multiple Data Sets
For this example, we will apply a simple data processing technique: filtering. (If you want to learn the basics of data filtering in R, check out my post, The Fundamentals of Filtering in R.)
Here, our goal is simple: retain students who applied for Full Time (ft) enrollment across the four data sets. Importantly, we do not want to assign the same list names to our newly processed data sets, nor do we want to write over the existing data frames.
Before diving in, create a character object that contains the new names that will be applied to the filtered data sets.
Ok, onwards.
Option # 1: for loop
Control flow is a fundamental programming concept. It refers to how statements are executed in a program. A for loop is a type of control flow statement that executes a set of statements over elements of a vector.
Hadley, notes that the basic form of the for loop is:
for (item in vector) do_something
Which roughly translates to for each item in vector, do_something is called once, and the value of item is updated each time.
Below is a simple example of a for loop. Note how i updates to the next letter in the sequence after each iteration.
Given that a for loop can iterate over elements of a vector, we can use this control flow statement to accomplish our task:
Note: the placeholder object (datR) is created before executing the loop.
Option # 2: lapply
lapply is another option. The family of apply functions (of which lapply is a part) are the backbone of Base R. They allow you to apply a function over a list or vector. lapply has two main arguments: X, which is typically a vector, and FUN, a function to apply over the vector.
The example for loop using letters can be rewritten using lapply:
Option # 3: sapply
Another possibility is sapply. Like lapply, sapply is used to apply a function to each element of a vector. Where they differ is that sapply simplifies the output. Using sapply, the code to accomplish the task could be written as:
Note: a user-defined function is being used within sapply.
Option # 4: map
map from the purrr package is yet another choice. map is very similar to lapply. That is, it is used to apply a function to each element of a vector. The function has two main arguments: .x, a list or atomic vector; and .f, which is typically a function or formula.
Using, map, the task can be accomplished like so:
Putting it All Together
After processing, do not forget to export a copy of the filtered data sets to your directory for safe keeping!
And VOILA! Pretty easy, right?
The next time you find yourself having to process multiple data sets that contain the same variables, use one of my methods to streamline the task.
Do you have a preferred method? Let me know in the comments.