Setting Missing Values When Importing Data

Image of a completed puzzle with one missing piece and in the space of the missing piece are the words 'missing values'.

**Download the R syntax and data used in this post **

Most real-world data sets contain missing values. The trick is to make sure whatever software or programming environment you use for data analysis knows which values should be set to missing.

Many functions for importing tabular data into R allow you to define values you want to set as missing. (Even those in Base R.) Check out the generic read.table function:

read.table(file, 
	header = FALSE, 
	sep = "", 
	quote = "\"'",
	dec = ".", 
	numerals = c("allow.loss", "warn.loss", "no.loss"),
	row.names, 
	col.names, 
	as.is = !stringsAsFactors,
	na.strings = "NA",
	colClasses = NA, 
	nrows = -1,
	skip = 0, 
	check.names = TRUE, 
	fill = !blank.lines.skip,
	strip.white = FALSE, 
	blank.lines.skip = TRUE,
	comment.char = "#", 
	allowEscapes = FALSE, 
	flush = FALSE, stringsAsFactors = FALSE,
	fileEncoding = "", 
	encoding = "unknown", 
	text, 
	skipNul = FALSE)

The argument na.strings (highlighted above) is where you can specify a character vector of stings you want R to interpret as missing values.

Even newer(ish) functions from the readr package like read_table have a similar argument:

read_table(
  file,
  col_names = TRUE,
  col_types = NULL,
  locale = default_locale(),
  na = "NA",
  skip = 0,
  n_max = Inf,
  guess_max = min(n_max, 1000),
  progress = show_progress(),
  comment = "",
  show_col_types = should_show_types(),
  skip_empty_rows = TRUE)

So, how can you make the best use of these arguments?

 Here's an example:

Say you have a comma-separated values (CSV) data set that has missing values. (You know the data set has missing values because the analyst who curated the data set provided a codebook!) 

According to the codebook, the missing values are labelled:

  • -999

  • <NA>

  • not available

  • missing

  • really missing

Create a character object with these five elements and then use the na.strings argument in read.csv (or the na argument in read_csv) to explicitly set these values as missing. R will interpret these values as missing once the data are imported.

# create na_strings object
na_strings <- c("-999","","not available","missing", "really missing")

# import data using read.csv (Base R)
dat_NA <- missing_df_NA <- read.csv(file = "missing_df.csv", header = TRUE, 
na.strings = na_strings)

# can also import data using read_csv (readr)
# dat_NA <- missing_df_NA <- read_csv(file = "missing_df.csv", header = TRUE, 
# na = na_strings)

And that is it! Check your data set to ensure missing values are appropriately labeled as NA (R's missing value). (For brevity, only the Gender variable is shown below.)

sapply(dat_NA, FUN = function(x) unique(x), simplify = TRUE)
$Gender
[1] "Non-Binary" "Woman"      NA           "Man"       

** Bonus **

Sometimes when I am not sure what values should be set to missing, I start with the most common labels:

  • empty string ( "" ) with a length of zero characters

  • "NA" (NA in quotes. R does not treat NA in quotes the same as its (internal) missing value of NA. So if you are 100% sure that "NA" is equal to missing (or not available), then include it in the list.)

  • NA (no quotes; sometimes I throw in R's missing value for good measure / as a just-in-case.)

Missing values are a fact of data life. So, when you know what values should be set to missing, make your life easier by specifying them (if you can) before importing your data.

 

How do you prefer to deal with missing values? Do you replace missing values after importing data or before? Let me know in the comments.

Previous
Previous

Select Columns of a Specific Type in R

Next
Next

Removing Extra Spaces in a (Text) String