Analytics Made Accessible

View Original

Be Careful With (Data) Binning

Before/After slide transformation. Before slide (left) shows a hex tile map of state minimum wages, where deep blue hues correspond to locations where the state minimum wage is greater than the federal minimum wage and (medium) grey hues correspond to all other states. The after slide (right) shows a hex tile map of state minimum wages, where states shaded in white do not have a minimum wage; medium grey hues correspond to states where the minimum wage is less than the federal minimum; azure (blue) shades correspond to states where the minimum wage is equal to the federal minimum; and dark blue hues correspond to states where the minimum wage is greater than the federal minimum.

Binning in data is not about throwing stuff in the trash. Though sometimes it can feel that way. "Binning" is a way of grouping data into distinct categories. Data practitioners often use binning to transform a continuous variable into a categorical one where each group represents a range (or bin) of numeric values.

 

An Example

Suppose you are a data analyst at an economic policy consulting firm. Your boss asks you to analyze data on state minimum wages.

Your boss wants you to create a chart that shows minimum wages in U.S. states. You produce the following hex(agonal) tile map showing the full range of (dollar) values. On the map:

  • States shaded in white do not have a minimum wage, 

  • Lighter grey hues correspond to lower minimum wages, and 

  • Darker grey hues correspond to higher minimum wages.

Image of a slide that shows a hex tile map of state minimum wages, where states shaded in white do not have a minimum wage; lighter grey hues correspond to lower minimum wages; and darker grey hues correspond to higher minimum wages.

You excitedly show your boss your creation. Your boss takes a look and decides they want you to group the data into meaningful categories.

Back to the drawing board. 

According to the Department of Labor, the federal minimum wage is $7.50. So, you decide to use that information and create a binary variable called greaterFedMinWage that equals "1" if the state's minimum wage is greater than the federal minimum wage and "0" otherwise. 

You (re)produce your tile map using the new greaterFedMinWage variable. On the map:

  • Deep blue hues correspond to locations where the state minimum wage is greater than the federal minimum wage.

  • (Medium) Grey hues correspond to all other states. 

Image of a slide that shows a hex tile map of state minimum wages, where deep blue hues correspond to locations where the state minimum wage is greater than the federal minimum wage and (medium) grey hues correspond to all other states.

Your boss takes another look and asks, "So, all other states have a minimum wage that is less than the federal minimum?"

You respond, "No, a few states do not have a minimum wage. And two states have a minimum wage that is less than the federal minimum."

Your boss looks at you puzzled, then asks, "Can you create a variable that shows all that?"

You say, "Sure!"

Back to the drawing board x2.

You create a (new) variable called minWageGrp that equals:

  • "1" if the state does not have a minimum wage.

  • "2" if the state minimum wage is less than the federal minimum.

  • "3" if the state minimum wage is equal to the federal minimum.

  • "4" if the state minimum wage is greater than the federal minimum.

Image of a slide that shows a hex tile map of state minimum wages, where states shaded in white do not have a minimum wage; medium grey hues correspond to states where the minimum wage is less than the federal minimum; azure (blue) shades correspond to states where the minimum wage is equal to the federal minimum; and dark blue hues correspond to states where the minimum wage is greater than the federal minimum.

Your boss LOVES it.

It's simple, but the categories are meaningful, and BONUS, the chart is easy to interpret.

Grouping data into bins or categories can make it easier to analyze AND visualize. But binning can hide trends in your data or lead to deceptive data (re)presentations.

So, folks, bin responsibly.