Analytics Made Accessible

View Original

Variable Renaming in R

** Download the R syntax and data files referenced in the post **

Over the years, I have developed thousands of data cleaning scripts in R. And without a doubt, recoding (i.e., recoding variable names, recoding variable values) is a task that I keep coming back to year after year.

There are many avenues for recoding variables and values in R. Some solutions are easy to implement, while others take into consideration unknown or variable components to your process.


MAKING RECODING EASY

Nearly a decade ago, I worked with one of the most brilliant women I have ever met. She taught me the most valuable data lesson I still carry with me today: to execute a successful workflow requires meticulous preparation. To her, and for me, that means developing a codebook.

The "Codebook Method"

Not all codebooks are created equal. But most would agree that the value of a codebook lies in its ability to provide a full accounting of the variables in a data set. Successful codebooks typically contain the following information about each variable in the data set:

  • Variable Name: The name assigned to each variable

  • New Variable Name: A new name you wish to assign to each variable

  • Variable Label: A brief description outlining what the variable is measuring.

  • Level of Measurement: How the variable was measured (e.g., nominal, ordinal, interval ratio scale)

  • Variable Values: List of the values a variable can take

  • Value Labels: Descriptions for each unique variable value

While many opt to generate a codebook using an R package, I always create mine manually. That way, I can control what information is extracted and used during the data preprocessing phase.

In the next two posts, I will share some of my favorite methods for recoding variable names and values in the R environment (without too much of a hassle).

This post will focus on variable renaming. The next post will cover my favorite options for recoding variable values.

--

 For this post, I generated a fake dataset

and created a sample codebook, much like one I would use during a typical data cleaning procedure. The codebook has six columns:

  • new_name: A new name I want to assign to each variable in the data set

  • old_name: The original name assigned to each variable in the data set

  • variable label: A description of the variable

  • type: Identifies what level of measurement the variable has

  • values: List of the values each variable can have

  • value.labels: Descriptions for each unique variable value

VARIABLE RENAMING

Back in the day, I remember praying to the data deities that the variables I had to rename were in the exact order as the columns in the data set (or imported data frame) I was transforming. Of course, that is completely crazy. Seriously, do not count on the variables being in the same order.

Despite this (almost) certainty, I am sure when you learned R, you renamed variables like this:

See this content in the original post

Don't get me wrong: that is a viable way to rename columns in a data frame.

 **BUT**

 What happens when the data set's columns are not in the same order?

 What is more, what if you have to rename HUNDREDS of columns?

 I don't know about you, but I am NOT taking any chances on accidentally recoding variables names incorrectly NOR would I spend the time to sit and write the names of the new columns enclosed in quotations only to be followed by commas. Seriously, who has the time to do that?!

Well, using my trusty "codebook method" you can cut down on the amount of time you spend developing R scripts and go on your merry data way.

So, what are some better options for renaming variables in R?  Read on to learn more about my favorite variable renaming methods.

 

Setup

First, load the packages and import the fake data set and codebook into your R environment.

See this content in the original post

Note: the last line of code imports (and selects) the first two columns of the codebook into the R environment. If you remember, this is where the new and old variable names are stored. Once imported, we can use this information for various tasks, such as renaming our data frame columns.

Alrighty, let's get to the fun stuff.

Option 1: Match (Packages used: Base R)

For those still skeptical of using the Tidyverse packages, fear not! The match function, from Base R, will get the job done.

The match function is pretty nifty. It allows you to search for a value or string in a data set (or in our case, column names) and find and return the index location for the corresponding value or string. The best part? You can still use the assignment operator to assign the new names to the variables in our dataset as shown below:

See this content in the original post

There is, however, *one* downside of using this method. It automatically erases the existing names of the referenced data frame and replaces them with the new names. Hence why I tend to create a copy of the data frame before using this procedure.

 

Option 2: setnames (Packages used: data.table)

Another option is the setnames function from the data.table package. (Not to be confused with the setNames function.) data.table is probably one of my favorite R packages. It has a function for everything, including renaming things. The setnames function has three main arguments, x (a data frame or data table); old (in our case a vector containing the old variable names); and new (for our purposes, this is where you will specify a vector containing the new variable names). A less commonly invoked argument, skip_absent, is also a good idea to specify (if you don't want to trip an error).

See this content in the original post

Like Option 1, the setnames method will, by default, overwrite the existing names of the referenced data frame. So, beware.

 

Option 3: rename + deframe

(Packages used: dplyr and tibble; both can be loaded from the tidyverse)

Yet another popular option is the rename function from the dplyr package coupled with the deframe function from the tibble package. Using the rename function, the names of individual variables can be changed. This of course poses a problem if you want to make multiple changes. But this is where deframe comes in. deframe creates a named vector from two columns of a data frame. (Note: Because deframe creates a named vector, do not forget to apply the bang-bang-bang (!!!) operator!)

See this content in the original post

Option 4: rename + setNames / set_names

(Packages used: dplyr (can be loaded from the tidyverse), Base R, and rlang)

Option 4 is almost identical to Option 3. Like the deframe function, setNames creates a named vector. Using setNames, however, you can explicitly set the vector names to be replaced as well as the vector of new names to be applied. (The same is true for set_names, which is identical to setNames, but has more bells and whistles.)

(Note: Both setNames and set_names create named vectors, so do not forget to apply the bang-bang-bang (!!!) operator!)

See this content in the original post

Option 5: rename_with + matches

(Packages used: dplyr and tidyselect from the tidyverse)

Last, but certainly not least, is rename_with (rename's cooler, less restrictive cousin). The major benefit of using rename_with over rename is that variable names can be changed using a function. What helps to power this method is the matches function from the tidyselect package. Like its name suggests, matches allows you to select variables that match a pattern.

See this content in the original post

See, a few lines of code are all you need to rename your variables. But of course, that means putting the time in to create a codebook.

But trust me, it is a small price to pay for flawlessly executing a workflow.

What is your favorite method for renaming variables in R? Share your code in the comments below.

Need one-on-one data support? Get in Touch!

See this content in the original post