Removing Extra Spaces in a (Text) String

Sep 6

Image of two squares. Square 1 (left) reads Extra Spaces with a large space between the two words extra and spaces. Square 2 (right) reads No Extra Spaces without any spaces between the words.

Increasingly, data professionals are expected to know how to wrangle text data and create awe-inspiring visualizations. And like other forms of data, before text can be mined for insights, it must be cleaned

Removing unwanted spaces is one of the most common tasks performed when cleaning text data. Many functions for importing data into R have an argument controlling whether extra spaces are removed from the beginning and end of a string.

Two notable examples are read.table from Base R:

## DO NOT RUN ## 
read.table(file, 
  header = FALSE,
  sep = "", 
  quote = "\"'",  
  dec = ".", 
  numerals = c("allow.loss", "warn.loss", "no.loss"),
  row.names, 
  col.names, 
  as.is = !stringsAsFactors,
  na.strings = "NA", 
  colClasses = NA, 
  nrows = -1,
  skip = 0, 
  check.names = TRUE, 
  fill = !blank.lines.skip,
  strip.white = FALSE, 
  blank.lines.skip = TRUE,
  comment.char = "#", 
  allowEscapes = FALSE, 
  flush = FALSE,
  stringsAsFactors = FALSE, 
  fileEncoding = "", 
  encoding = "unknown", 
  text,
  skipNul = FALSE)

and read_csv from readr

## DO NOT RUN ##
read_csv(
  file,
  col_names = TRUE,
  col_types = NULL,
  col_select = NULL,
  id = NULL,
  locale = default_locale(),
  na = c("", "NA"),
  quoted_na = TRUE,
  quote = "\"",
  comment = "",
  trim_ws = TRUE,
  skip = 0,
  n_max = Inf,
  guess_max = min(1000, n_max),
  name_repair = "unique",
  num_threads = readr_threads(),
  progress = show_progress(),
  show_col_types = should_show_types(),
  skip_empty_rows = TRUE,
  lazy = should_read_lazy()
)  

So, these days, most folks do not have to do too much work to remove leading or trailing spaces from their data during the import process. But what happens when there are extra spaces between words in your text? Let me show you an example.

Ever import text data, and the string output looks like this:

dat_trim <- readLines("Mary's Lamb.txt", warn = FALSE)
dat_trim

[1] "  Mary had a \tlittle\t\t   lamb.\t"         
[2] " \tIts fleece    was      white as snow."    
[3] "And   \t\t everywhere     the \tchild  went."
[4] "\tThe  little \tlamb was     sure to     go."

Yes, those are the first four lines of the nursery rhyme "Mary Had a Little Lamb" (Hales, 1830). But each element of the object contains more than a nursery rhyme line. Some lines have extra spaces between words, while others have special symbols scattered throughout. These symbols represent different types of space characters.

Many kinds of space characters can be present in strings. Three common space characters include:

horizontal tab ( \ t )
carriage return ( \r )
new line ( \n )

So, how do you remove extra spaces in a string?

There are several ways to manipulate strings or character vectors in R. When it comes to removing extra spaces, most of these methods rely on finding and replacing elements of a character vector that match a pattern. In this post, I will focus on two ways to remove extra spaces in a character vector:

Base R solutions and
Using functions from the stringr and stringi packages.

Base R

gsub is a nifty Base R function used to search for and replace elements of a string. The function has three main arguments: (1) a pattern to be matched; (2) a replacement for the matched pattern; and (3) a character vector where matches are sought.

gsub with specified characters

A cursory look at the text data revealed three distinct space characters: tab ( \t ), new line ( \n ), and carriage return ( \r ). One way to remove these characters is to create a user-defined function that extracts and replaces them with an empty string ( "" ).

 trim_space_gsub <- function(x){
	x <- gsub(pattern =  "\t", replacement = "", x = x)
	x <- gsub(pattern =  "\n", replacement = "", x = x)
	x <- gsub(pattern =  "\r", replacement = "", x = x)
}

dat_trim1 <- trim_space_gsub(dat_trim)
dat_trim1

[1] "  Mary had a little   lamb."        
[2] " Its fleece    was      white as snow."  
[3] "And    everywhere     the child  went."  
[4] "The  little lamb was     sure to     go."

Ok, so clearly, the text data contain more than tabs, new lines, and carriage returns. You could spend hours poring over hundreds, thousands, or millions of sentences to identify all of the unique space characters in your text data…but that sounds a lot like torture.

Now, I want to clarify that gsub is not the problem. How gsub is being used is. Rather than manually identifying and specifying the types of space characters you wish you extract and replace, implement a more efficient solution using predefined regex classes!

Regex is short for regular expression. A regular expression is a character or sequence of characters that describe a search pattern. They should look familiar because in the previous gsub example, escaped t ( \t ), escaped n ( \n ), and escaped r ( \r ) are regex that matches and replace the tab, new line, and carriage return characters in the text data. These backslash characters are the simplest match patterns that can be specified using regex. (And yes, that is a fancy way of saying you put a backslash before the letter.)

Character classes can also be specified as match patterns. These "classes" are simply a set of characters. Lucky for us, there are two predefined classes of space characters that we can use to find and replace spaces in text data (e.g., space, tab, and newline character):

backslash s ( \s ) and
[:space:].

(This post will only focus on \s.) The important thing to remember here is that you cannot input backslash s ( \s ) into the pattern argument of a gsub function because it is a special character group that matches all space characters.

## DO NOT DO THIS
gsub(pattern =  "\s", replacement = " ", dat_trim)

Error: '\s' is an unrecognized escape in character string starting ""\s"

Instead, escape the backslash character so that R will understand that you want to find and replace all space characters in your data.

## DO THIS INSTEAD
gsub(pattern =  "\\s", replacement = " ", dat_trim)

[1] "  Mary had a little   lamb."
[2] " Its fleece    was      white as snow."  
[3] "And    everywhere     the child  went."  
[4] "The  little lamb was     sure to     go."

Now, the text still contains extra spaces. You need to use more than the space character to remove all unwanted spaces from a string. For our purposes, we can ensure all unwanted spaces are removed by:

Using the plus sign (a regex metacharacter) and
Wrapping gsub in trimws

The plus sign ( + ) is a regex metacharacter that indicates that the preceding character should be matched one or more times. So, when combined with the space character, it can be used to replace all instances of spaces in a string. trimws is a Base R function used for removing extra spaces from the beginning and end of a string. The function has two main arguments: (1) A character vector and (2) a character string specifying whether to remove leading, trailing, or both leading AND trailing spaces (which is the default). And when used together (in a user-defined function), we can accomplish our task of removing all unwanted spaces in a text:

trim_ws_gsub <- function(x){
	trimws(gsub(pattern =  "\\s+", replacement = " ", x))
}

trim_ws_gsub(dat_trim)

[1] "Mary had a little lamb."        
[2] "Its fleece was white as snow."  
[3] "And everywhere the child went." 
[4] "The little lamb was sure to go."

stringr and stringi solutions

There are also several ways to remove extra spaces from strings using functions from the stringr and stringi packages.

stri_trim and str_replace_all

One popular solution is to create a user-defined function with str_replace_all and stri_trim. Much like gsub, str_replace_all can be used to replace matched patterns in a string. The function has three arguments: (1) A character vector; (2) a pattern to look for; and (3) a replacement for the matched pattern. The example removing extra spaces using gsub and trimws can be rewritten as:

strim_ws <- function(x){
	stri_trim(str_replace_all(x, pattern =  "\\s+", replacement = " "))
}

strim_ws(dat_trim)

[1] "Mary had a little lamb."        
[2] "Its fleece was white as snow."  
[3] "And everywhere the child went." 
[4] "The little lamb was sure to go."

str_squish

The str_squish function from the stringr package is a final yet underused choice. str_squish is a nifty little function that removes leading and trailing spaces from character strings AND converts multiple spaces or space-like characters BETWEEN elements of a string into a single space. So, our user-defined functions can be collapsed into a single line of code:

str_squish(dat_trim)

[1] "Mary had a little lamb."        
[2] "Its fleece was white as snow."  
[3] "And everywhere the child went."
[4] "The little lamb was sure to go."

And there you have it! Removing extra spaces before, after, and within a string does not have to be complicated. The next time you find yourself having to clean text data with unwanted spaces, use one of my methods to streamline the task.

Do you have a preferred method? Let me know in the comments.

References

Hale, S. J. (1830). Poems For Our Children: Including "Mary Had a Little Lamb." Boston, MA: Marsh, Capen, & Lyon. Available from https://archive.org/details/poemsforourchild00haleiala/page/n1/mode/2up.

p2p1DataAnalysis

Ama Nyame-Mensah https://www.anyamemensah.com

Removing Extra Spaces in a (Text) String

References

Setting Missing Values When Importing Data

Gridlines Be Gone!