Renaming Columns in Python
**Download the Python script(s) and data referenced in this post.**
A while back, I received a message from someone asking why I do not have any posts on how to work with data in Python. The short answer is that I'm lazy. The long answer is that most of my clients use R as their programming language of choice, so most of my data content centers around this tool.
But in the spirit of diversifying, I thought I would answer one of the most popular Python-related questions I am often asked:
How do you efficiently rename columns in a DataFrame in Python?
My simple answer is to use a dictionary.
The importance of column names
Renaming columns may seem like a simple task. However, column names in tabular data are important. Using clear, consistent, and meaningful column names makes your data easier to understand, supports data integrity and quality, and helps end users understand the data content. (Not to mention, there are few things more annoying than encountering unexpected pipeline errors due to non-existent columns.)
There are many ways to rename columns of a DataFrame in Python. This post will demonstrate how to rename columns of a DataFrame using the Pandas
(version 2.2.2) and Polars
(version 1.12.0) libraries.
Note: All mentions of variables
in this post refer to columns
in a tabular data set or DataFrame, not objects. Python scripts used in this post were written in Visual Studio Code for MacOS (version 1.94.2)
Setup
For this post, I am using my favorite (fake) fruit data set (fruit_data.csv
). This data set contains data from a survey study that asked adults in a large urban area to describe how much they liked 20 different fruits. The data set has 1,064 observations and 21 columns. Current column names are:
i_d
: Personal identification number.a
: How much do you like apples?b
: How much do you like blueberries?c
: How much do you like cranberries?d
: How much do you like dates?e
: How much do you like elderberries?f
: How much do you like figs?g
: How much do you like grapes?h
: How much do you like honeydew melons?i
: How much do you like indian gooseberries?j
: How much do you like jackfruits?k
: How much do you like kiwis?l
: How much do you like lemons?m
: How much do you like mangos?n
: How much do you like nectarines?o
: How much do you like oranges?p
: How much do you like papayas?q
: How much do you like quinces?r
: How much do you like raspberries?s
: How much do you like strawberries?t
: How much do you like tangerines?
Notice how the column names tell you next to nothing about the measured variable?
Accompanying the data is an abridged codebook (fruit_codebook.csv
). (You can read more about my "Codebook Method" in my post Variable Renaming in R
.) The codebook has six columns:
old_name
: The original name assigned to each variable in the data set.new_name
: The new name you want to assign to each variable in the data set.variable_label
: A brief description of what the variable measures.type
: How the variable was measured (e.g., nominal, ordinal, interval ratio scale).values
: List of the values a variable can take.value_labels
: Descriptions for each unique variable value.
Below is a table showing the old column names (left column) and the new names (right column) they will be replaced with:
Renaming columns in a Pandas DataFrame
The easiest way to rename columns in a Pandas DataFrame is to use the
.rename
method. To replace some or all of the column names, you can use a dictionary with old column names as keys and new column names as values and then pass this dictionary to the columns
parameter.
If you are unfamiliar with dictionaries and are a Python programmer/developer, I encourage you to learn more about them by reading through the
dictionaries documenation
. They are powerful and useful tools that can help you solve a wide range of problems.
Here's one way to implement a dictionary solution using Pandas:
### import library
import pandas as pd
### Set up
## codebook
# load codebook
fruit_names_cb = pd.read_csv("fruit_codebook.csv")
# create dictionary: keys are old names, values are new names
fruit_names_dict = dict(zip(fruit_names_cb['old_name'],
fruit_names_cb['new_name']))
Here is what the dictionary looks like
{'i_d': 'person_id', 'a': 'apple', 'b': 'blueberry', 'c': 'cranberry', 'd': 'dates', 'e': 'elderberry', 'f': 'figs', 'g': 'grapes', 'h': 'honeydew', 'i': 'indian gooseberry', 'j': 'jackfruit', 'k': 'kiwi', 'l': 'lemon', 'm': 'mango', 'n': 'nectarine', 'o': 'orange', 'p': 'papaya', 'q': 'quince', 'r': 'raspberries', 's': 'strawberries', 't': 'tangerine'}
The next step is to import the data:
## data set
fruit_data = pd.read_csv("fruit_data.csv")
Now, if you wanted to rename only one column, you could create a dictionary using curly braces {}
setting the old column as the key and the new column name as the value (note: only partial output is shown):
### Example: renaming one column at a time
fruit_data.rename(columns = {"p":"papaya"})
i_d papaya i j l m k s t d g c a o b q f h r n e
1 1 0 0 0 2 2 1 2 1 2 0 2 1 0 0 1 2 0 2 2
2 1 1 2 0 2 2 2 0 1 0 0 2 0 2 2 2 0 2 1 2
3 0 1 1 2 1 0 1 0 0 0 2 2 0 1 1 1 0 2 2 1
4 2 1 1 2 2 0 0 1 2 0 0 1 2 1 0 2 1 1 0 0
5 1 2 0 2 1 2 1 0 2 0 1 0 2 1 0 0 0 2 0 1
But who has time for that? Instead use the dictionary you created and pass it to the columns
parameter
### Example: renaming multiple columns in one go
fruit_data.rename(columns = fruit_names_dict)
person_id papaya indian gooseberry jackfruit lemon mango
1 1 0 0 0 2
2 1 1 2 0 2
3 0 1 1 2 1
4 2 1 1 2 2
5 1 2 0 2 1
Pretty simple, right?
What if you selected a subset of the fruit_data
data set and attempted to rename the columns in the new DataFrame using the dictionary? Per the pandas.DataFrame.rename
documentation, keys that exist in a dictionary but do not have a corresponding column in a DataFrame will simply be ignored (note: only partial output is shown).
## rename a select few columns using the larger dictionary
# pull a subset of columns
fruit_data_sub = fruit_data.iloc[:,11:20]
c a o b q f h r n
0 2 1 0 0 1 2 0 2
0 2 0 2 2 2 0 2 1
2 2 0 1 1 1 0 2 2
0 1 2 1 0 2 1 1 0
1 0 2 1 0 0 0 2 0
# try to rename (no issues renaming)
fruit_data_sub.rename(columns = fruit_names_dict)
cranberry apple orange blueberry quince figs honeydew
0 2 1 0 0 1 2
0 2 0 2 2 2 0
2 2 0 1 1 1 0
0 1 2 1 0 2 1
1 0 2 1 0 0 0
Renaming columns in a Polars DataFrame
Much like when using Pandas, the easiest way to rename columns in a Polars DataFrame is to use the
.rename
method. To replace some or all of the column names, you can use a dictionary with old column names as keys and new column names as values and then pass this dictionary to the mapping
parameter.
Let's walk through it:
First, load the codebook and convert it into a dictionary, where old names are keys and new names are values:
### Set up
## load codebook
fruit_names_cb = pl.read_csv("fruit_codebook.csv")
# create dictionary (for replacing old names with new ones)
fruit_names_dict = dict(fruit_names_cb.select(['old_name','new_name']).iter_rows())
Here is what the dictionary looks like:
{'i_d': 'person_id', 'a': 'apple', 'b': 'blueberry', 'c': 'cranberry', 'd': 'dates', 'e': 'elderberry', 'f': 'figs', 'g': 'grapes', 'h': 'honeydew', 'i': 'indian gooseberry', 'j': 'jackfruit', 'k': 'kiwi', 'l': 'lemon', 'm': 'mango', 'n': 'nectarine', 'o': 'orange', 'p': 'papaya', 'q': 'quince', 'r': 'raspberries', 's': 'strawberries', 't': 'tangerine'}
The next step is to load the data set:
## load data set
fruit_data = pl.read_csv("fruit_data.csv")
Like before, you could rename columnns one by one (note: only partial output is shown):
### Example 1
## rename columns one by one
fruit_data.rename(mapping = {"p":"papaya"})
shape: (1_064, 21)
┌──────┬────────┬─────┬─────┬───┬─────┬─────┬─────┬─────┐
│ i_d ┆ papaya ┆ i ┆ j ┆ … ┆ h ┆ r ┆ n ┆ e │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪════════╪═════╪═════╪═══╪═════╪═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 0 ┆ 0 ┆ … ┆ 2 ┆ 0 ┆ 2 ┆ 2 │
│ 2 ┆ 1 ┆ 1 ┆ 2 ┆ … ┆ 0 ┆ 2 ┆ 1 ┆ 2 │
│ 3 ┆ 0 ┆ 1 ┆ 1 ┆ … ┆ 0 ┆ 2 ┆ 2 ┆ 1 │
│ 4 ┆ 2 ┆ 1 ┆ 1 ┆ … ┆ 1 ┆ 1 ┆ 0 ┆ 0 │
│ 5 ┆ 1 ┆ 2 ┆ 0 ┆ … ┆ 0 ┆ 2 ┆ 0 ┆ 1 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 1060 ┆ 2 ┆ 2 ┆ 2 ┆ … ┆ 0 ┆ 1 ┆ 0 ┆ 2 │
│ 1061 ┆ 2 ┆ 1 ┆ 1 ┆ … ┆ 2 ┆ 1 ┆ 1 ┆ 0 │
│ 1062 ┆ 1 ┆ 1 ┆ 2 ┆ … ┆ 0 ┆ 1 ┆ 2 ┆ 1 │
│ 1063 ┆ 0 ┆ 0 ┆ 0 ┆ … ┆ 2 ┆ 0 ┆ 2 ┆ 0 │
│ 1064 ┆ 2 ┆ 1 ┆ 1 ┆ … ┆ 2 ┆ 1 ┆ 0 ┆ 0 │
└──────┴────────┴─────┴─────┴───┴─────┴─────┴─────┴─────┘
Then again, why not use our dictionary and rename all columns in one go (note: only partial output is shown):
### Example 2
## rename all columns using the dictionary
fruit_data.rename(mapping = fruit_names_dict)
shape: (1_064, 4)
┌───────────┬────────┬───────────────────┬───────────┐
│ person_id ┆ papaya ┆ indian gooseberry ┆ jackfruit │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪════════╪═══════════════════╪═══════════╡
│ 1 ┆ 1 ┆ 0 ┆ 0 │
│ 2 ┆ 1 ┆ 1 ┆ 2 │
│ 3 ┆ 0 ┆ 1 ┆ 1 │
│ 4 ┆ 2 ┆ 1 ┆ 1 │
│ 5 ┆ 1 ┆ 2 ┆ 0 │
│ … ┆ … ┆ … ┆ … │
│ 1060 ┆ 2 ┆ 2 ┆ 2 │
│ 1061 ┆ 2 ┆ 1 ┆ 1 │
│ 1062 ┆ 1 ┆ 1 ┆ 2 │
│ 1063 ┆ 0 ┆ 0 ┆ 0 │
│ 1064 ┆ 2 ┆ 1 ┆ 1 │
└───────────┴────────┴───────────────────┴───────────┘
But what if you took a subset of the fruit_data
data set:
## try to rename a select few columns using the larger dictionary
# pull a subset of columns
fruit_data_sub = fruit_data.select(pl.nth(range(11, 20)))
shape: (1_064, 9)
┌─────┬─────┬─────┬─────┬───┬─────┬─────┬─────┬─────┐
│ c ┆ a ┆ o ┆ b ┆ … ┆ f ┆ h ┆ r ┆ n │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═══╪═════╪═════╪═════╪═════╡
│ 0 ┆ 2 ┆ 1 ┆ 0 ┆ … ┆ 1 ┆ 2 ┆ 0 ┆ 2 │
│ 0 ┆ 2 ┆ 0 ┆ 2 ┆ … ┆ 2 ┆ 0 ┆ 2 ┆ 1 │
│ 2 ┆ 2 ┆ 0 ┆ 1 ┆ … ┆ 1 ┆ 0 ┆ 2 ┆ 2 │
│ 0 ┆ 1 ┆ 2 ┆ 1 ┆ … ┆ 2 ┆ 1 ┆ 1 ┆ 0 │
│ 1 ┆ 0 ┆ 2 ┆ 1 ┆ … ┆ 0 ┆ 0 ┆ 2 ┆ 0 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 1 ┆ 0 ┆ 1 ┆ 1 ┆ … ┆ 0 ┆ 0 ┆ 1 ┆ 0 │
│ 2 ┆ 1 ┆ 0 ┆ 1 ┆ … ┆ 0 ┆ 2 ┆ 1 ┆ 1 │
│ 2 ┆ 1 ┆ 0 ┆ 2 ┆ … ┆ 2 ┆ 0 ┆ 1 ┆ 2 │
│ 2 ┆ 1 ┆ 2 ┆ 0 ┆ … ┆ 1 ┆ 2 ┆ 0 ┆ 2 │
│ 0 ┆ 2 ┆ 2 ┆ 1 ┆ … ┆ 0 ┆ 2 ┆ 1 ┆ 0 │
└─────┴─────┴─────┴─────┴───┴─────┴─────┴─────┴─────┘
and tried using the dictionary to rename columns in that new DataFrame?
Well, you may get an error that looks a little something like this:
Schema at this point: Schema: name: c, field: Int64 name: a, field: Int64 name: o, field: Int64 name: b, field: Int64 name: q, field: Int64 name: f, field: Int64 name: h, field: Int64 name: r, field: Int64 name: n, field: Int64 Resolved plan until failure: ---> FAILED HERE RESOLVING THIS_NODE <--- DF ["c", "a", "o", "b"]; PROJECT */9 COLUMNS; SELECTION: None
If you get an error when trying to rename columns or the strict
parameter is not available to you when you attempt to use Polars' rename
method, don't panic. You have several options for solving this problem. Here are my top three:
- Download and install the latest (stable) version of Polars (per the latest documentation, you can get Polars' rename method to ignore column names that exist in the dictionary but do not exist in the DataFrame by setting
strict
toFalse
); - Use a little dictionary comprehension magic to populate a new dictionary; or
- Use a regular for loop to populate a dictionary to rename columns.
You are good to go if you decide to download and install the latest version of Polars. All you need to do after that is pass the dictionary to the mapping
parameter and set the strict
parameter to False
like so (note: only partial output is shown):
fruit_data_sub.rename(mapping = fruit_names_dict, strict = False)
┌───────────┬───────┬────────┬───────────┬────────┐
│ cranberry ┆ apple ┆ orange ┆ blueberry ┆ quince │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═══════╪════════╪═══════════╪════════╡
│ 0 ┆ 2 ┆ 1 ┆ 0 ┆ 0 │
│ 0 ┆ 2 ┆ 0 ┆ 2 ┆ 2 │
│ 2 ┆ 2 ┆ 0 ┆ 1 ┆ 1 │
│ 0 ┆ 1 ┆ 2 ┆ 1 ┆ 0 │
│ 1 ┆ 0 ┆ 2 ┆ 1 ┆ 0 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 1 ┆ 0 ┆ 1 ┆ 1 ┆ 2 │
│ 2 ┆ 1 ┆ 0 ┆ 1 ┆ 0 │
│ 2 ┆ 1 ┆ 0 ┆ 2 ┆ 0 │
│ 2 ┆ 1 ┆ 2 ┆ 0 ┆ 2 │
│ 0 ┆ 2 ┆ 2 ┆ 1 ┆ 2 │
└───────────┴───────┴────────┴───────────┴────────┘
But if you are not quite ready to update your library, let me show how to use a dictionary comprehension or for loop with a conditional statement to solve this problem.
Dictionary Comprehensions
A dictionary comprehension is a shorthand way for creating dictionaries in Python. You can almost think of a dictionary comprehension like a special kind of for loop. Although they are super handy, they can be a little difficult to read/understand if you are not comfortable with for loops.
For this example, we will take our data subset (fruit_data_sub
) and our larger dictionary (fruit_names_dict
) containing all old and new names and construct a dictionary comprehension. The new dictionary will only contain key-value pairs for columns contained within fruit_data_sub
:
{name: fruit_names_dict.get(name) for name in list(fruit_names_dict) if name in fruit_data_sub.columns}
{'a': 'apple', 'b': 'blueberry', 'c': 'cranberry', 'f': 'figs', 'h': 'honeydew', 'n': 'nectarine', 'o': 'orange', 'q': 'quince', 'r': 'raspberries'}
Dictionary comprehensions can look a little weird, being one line and all, so let's unravel it.
For loop
We can rewrite our dictionary comprehension with a conditional statement as a regular old for loop combined with a conditional statement:
sample_dict = {}
for name in list(fruit_names_dict):
if name in fruit_data_sub.columns:
sample_dict[name] = fruit_names_dict.get(name)
Essentially, all we are asking Python to do is create a new dictionary (sample_dict
) that is a subset of our fruit_names_dict
dictionary where keys match columns that exist in the fruit_data_sub
DataFrame.
That's it.
So, how can you use these solutions? Well, you have a lot of options. Here are three:
Solution 1: Use the dictionary comprehension as one very long one liner.
fruit_data_sub.rename(mapping = {name: fruit_names_dict.get(name) for name in list(fruit_names_dict) if name in fruit_data_sub.columns})
Solution 2: Create a dictionary with a regular for loop, and use the resulting dictionary to rename columns.
# for loop unraveled
sample_dict = {}
for name in list(fruit_names_dict):
if name in fruit_data_sub.columns:
sample_dict[name] = fruit_names_dict.get(name)
# using the outcome of the for loop
fruit_data_sub.rename(mapping = sample_dict)
Solution 3: Wrap the first option in a function .
# define function to create a new dictionary
def create_dict(polars_df:pl.DataFrame, name_dict:dict) -> dict:
"""create a new dictionary from a larger one.
Keyword arguments:
polars_df -- A Polars DataFrame
name_dict -- a dictionary containing key (old column name)
value (new column name) pairs.
"""
return {name: name_dict.get(name) for name in list(name_dict) if name in polars_df.columns}
# test it
create_dict(polars_df = fruit_data_sub, name_dict = fruit_names_dict)
# use it to rename DataFrame columns
fruit_data_sub.rename(mapping = create_dict(polars_df = fruit_data_sub,
name_dict = fruit_names_dict))
And there you have it.
There are so many ways to rename columns in Python.
How do you rename columns in Python? What libraries do you use?
Share your comments below.
Are you an R user? Check out my post on Renaming Variables in R .