Analytics Made Accessible

View Original

Setting Missing Values When Importing Data

**Download the R syntax and data used in this post **

Most real-world data sets contain missing values. The trick is to make sure whatever software or programming environment you use for data analysis knows which values should be set to missing.

Many functions for importing tabular data into R allow you to define values you want to set as missing. (Even those in Base R.) Check out the generic read.table function:

See this content in the original post

The argument na.strings (highlighted above) is where you can specify a character vector of stings you want R to interpret as missing values.

Even newer(ish) functions from the readr package like read_table have a similar argument:

See this content in the original post

So, how can you make the best use of these arguments?

 Here's an example:

Say you have a comma-separated values (CSV) data set that has missing values. (You know the data set has missing values because the analyst who curated the data set provided a codebook!) 

According to the codebook, the missing values are labelled:

  • -999

  • <NA>

  • not available

  • missing

  • really missing

Create a character object with these five elements and then use the na.strings argument in read.csv (or the na argument in read_csv) to explicitly set these values as missing. R will interpret these values as missing once the data are imported.

See this content in the original post

And that is it! Check your data set to ensure missing values are appropriately labeled as NA (R's missing value). (For brevity, only the Gender variable is shown below.)

See this content in the original post

** Bonus **

Sometimes when I am not sure what values should be set to missing, I start with the most common labels:

  • empty string ( "" ) with a length of zero characters

  • "NA" (NA in quotes. R does not treat NA in quotes the same as its (internal) missing value of NA. So if you are 100% sure that "NA" is equal to missing (or not available), then include it in the list.)

  • NA (no quotes; sometimes I throw in R's missing value for good measure / as a just-in-case.)

Missing values are a fact of data life. So, when you know what values should be set to missing, make your life easier by specifying them (if you can) before importing your data.

 

How do you prefer to deal with missing values? Do you replace missing values after importing data or before? Let me know in the comments.

See this content in the original post