Analytics Made Accessible

View Original

Removing Extra Spaces in a (Text) String

Increasingly, data professionals are expected to know how to wrangle text data and create awe-inspiring visualizations. And like other forms of data, before text can be mined for insights, it must be cleaned

Removing unwanted spaces is one of the most common tasks performed when cleaning text data. Many functions for importing data into R have an argument controlling whether extra spaces are removed from the beginning and end of a string.

Two notable examples are read.table from Base R:

See this content in the original post

and read_csv from readr

See this content in the original post

So, these days, most folks do not have to do too much work to remove leading or trailing spaces from their data during the import process. But what happens when there are extra spaces between words in your text? Let me show you an example.

Ever import text data, and the string output looks like this:

See this content in the original post

Yes, those are the first four lines of the nursery rhyme "Mary Had a Little Lamb" (Hales, 1830). But each element of the object contains more than a nursery rhyme line. Some lines have extra spaces between words, while others have special symbols scattered throughout. These symbols represent different types of space characters.

Many kinds of space characters can be present in strings. Three common space characters include:

  • horizontal tab ( \ t )

  • carriage return ( \r )

  • new line ( \n )

So, how do you remove extra spaces in a string?

There are several ways to manipulate strings or character vectors in R. When it comes to removing extra spaces, most of these methods rely on finding and replacing elements of a character vector that match a pattern. In this post, I will focus on two ways to remove extra spaces in a character vector: 

  • Base R solutions and

  • Using functions from the stringr and stringi packages.

Base R

gsub is a nifty Base R function used to search for and replace elements of a string. The function has three main arguments: (1) a pattern to be matched; (2) a replacement for the matched pattern; and (3) a character vector where matches are sought.

gsub with specified characters

A cursory look at the text data revealed three distinct space characters: tab ( \t ), new line ( \n ), and carriage return ( \r ). One way to remove these characters is to create a user-defined function that extracts and replaces them with an empty string ( "" ).

See this content in the original post

Ok, so clearly, the text data contain more than tabs, new lines, and carriage returns. You could spend hours poring over hundreds, thousands, or millions of sentences to identify all of the unique space characters in your text data…but that sounds a lot like torture.

Now, I want to clarify that gsub is not the problem. How gsub is being used is. Rather than manually identifying and specifying the types of space characters you wish you extract and replace, implement a more efficient solution using predefined regex classes!

Regex is short for regular expression. A regular expression is a character or sequence of characters that describe a search pattern. They should look familiar because in the previous gsub example, escaped t ( \t ), escaped n ( \n ), and escaped r ( \r ) are regex that matches and replace the tab, new line, and carriage return characters in the text data. These backslash characters are the simplest match patterns that can be specified using regex. (And yes, that is a fancy way of saying you put a backslash before the letter.)

Character classes can also be specified as match patterns. These "classes" are simply a set of characters. Lucky for us, there are two predefined classes of space characters that we can use to find and replace spaces in text data (e.g., space, tab, and newline character): 

  1. backslash s ( \s ) and

  2.  [:space:].

 (This post will only focus on \s.) The important thing to remember here is that you cannot input backslash s (  \s ) into the pattern argument of a gsub function because it is a special character group that matches all space characters.

See this content in the original post

Instead, escape the backslash character so that R will understand that you want to find and replace all space characters in your data.

See this content in the original post

Now, the text still contains extra spaces. You need to use more than the space character to remove all unwanted spaces from a string. For our purposes, we can ensure all unwanted spaces are removed by:

  1. Using the plus sign (a regex metacharacter) and

  2. Wrapping gsub in trimws

The plus sign ( + ) is a regex metacharacter that indicates that the preceding character should be matched one or more times. So, when combined with the space character, it can be used to replace all instances of spaces in a string. trimws is a Base R function used for removing extra spaces from the beginning and end of a string. The function has two main arguments: (1) A character vector and (2) a character string specifying whether to remove leading, trailing, or both leading AND trailing spaces (which is the default). And when used together (in a user-defined function), we can accomplish our task of removing all unwanted spaces in a text:

See this content in the original post

stringr and stringi solutions

There are also several ways to remove extra spaces from strings using functions from the stringr and stringi packages.

stri_trim and str_replace_all

One popular solution is to create a user-defined function with str_replace_all and stri_trim. Much like gsub, str_replace_all can be used to replace matched patterns in a string. The function has three arguments: (1) A character vector; (2) a pattern to look for; and (3) a replacement for the matched pattern. The example removing extra spaces using gsub and trimws can be rewritten as:

See this content in the original post

str_squish

The str_squish function from the stringr package is a final yet underused choice. str_squish is a nifty little function that removes leading and trailing spaces from character strings AND converts multiple spaces or space-like characters BETWEEN elements of a string into a single space. So, our user-defined functions can be collapsed into a single line of code:

See this content in the original post

And there you have it! Removing extra spaces before, after, and within a string does not have to be complicated. The next time you find yourself having to clean text data with unwanted spaces, use one of my methods to streamline the task.

Do you have a preferred method? Let me know in the comments.

See this content in the original post