4 Data Cleaning Lessons Learned

Image of three keyboard keys: (1) hyphen; (2) plus and equal sign; and (3) clean (instead of the delete key).

​Data cleaning is essential to the data life cycle because poor-quality data can lead to inaccurate results and misguided decisions. However, because there are no standardized procedures for dealing with messy data, many are reluctant to clean their data. So, what should you do? This post shares four lessons I have learned about data cleaning.

  1. Save. Save. Save. Always save a copy (or two) of the original raw dataset and store it in a safe place. Also, save an updated copy of the dataset with a new name after every major edit. That way, if you make a mistake or need to go back to an earlier version of the dataset for any reason, you do not have to start from scratch. 

    Regarding (data) filenames, some people use sequential numbering (e.g., dataset1, dataset2, dataset3) while others use the date the dataset was last modified (e.g., dataset_Jan1_2016, dataset_Jan4_2016, dataset_Jan7_2016).

  2. Document everything. Document every edit you make to a dataset in a codebook, (well commented!) syntax file, or repository with version control. This serves two purposes. First, it will ensure that others can replicate your procedures exactly. Second, should you need to justify your actions in any way (e.g., why you used a particular procedure, deleted a variable, or excluded cases from the final dataset), you will have a record of what you did and why.

  3. Know when to quit. No matter how hard you try to catch every single error, your data will never be 100% clean. Instead, identify your dataset's critical data quality problems and determine what steps you can take to address those problems. Once you have executed your plan, exhale and recite the following (however many times you feel is necessary): 'There is no such thing as perfect data,' and then move on.​

  4. Practice makes (almost) perfect. Unfortunately, there is no one 'correct' way to clean a dataset. The best way to learn the 'art' of data cleaning is to practice, practice, practice!

What lessons have you learned? Please share your thoughts in the comment section.

Previous
Previous

Dealing with Demographic Data