Mind the Mean

The arithmetic mean is a commonly used measure for summarizing data.  Simply put, the mean is equal to the sum of values in a set divided by the number of values in the set.

set of numbers: one through ten
picture showing how to calculate the mean of 1+2+3+4+5+6+7+8+9+10, which is equal to 5.5

While the mean can be informative, there are certain situations when the mean is not a good representation of a set of data. In this post, I will share two cases where the mean is not an appropriate measure for summarizing data, along with suggestions for what you could report instead.

Case #1: When your data has extreme values
​ 

An extreme value is a value in a set of data that is either very large or very small in comparison to the rest (or majority) of the data. Extreme values in a set of data can significantly influence the mean and can give you a false impression of the data. Let’s look at an example:

The data below represent the number of miles traveled to and from school each day by a sample of 10 students. Most students reported traveling between 2 to 9 miles a day. One student, however, lives in a different city, so their daily commute to and from school is 50 miles! 

Picture illustrating an extreme value of 50 in a set of data where most numbers are less than 10. Set of data include the numbers: 4,8,3,2,5,9,7,5,2,50. These numbers represent  the number of miles traveled to and from school each day by a sample of…

​​The results were analyzed, and the mean was obtained.

picture showing the mean of a set of the following numbers: 4,8,3,2,5,9,7,5,2,50, which is equal to 9.5. These numbers represent  the number of miles traveled to and from school each day by a sample of 10 students

​Because of the extreme value (50), the mean of the set of data is 9.5 miles, which is larger than 9 of the 10 values in the set! In this case, the mean would not be an appropriate measure to report, because the data contains an extreme value.
 
Possible Solution: Use the median
If you suspect you have extreme values in a set of data, first, plot the data to find the extreme value(s). Two plots that are useful for finding extreme values and other anomalies in a data set are scatter plots and box plots. 

Box plot

Box plot of the number of miles students traveled to and from school

Scatter plot

Scatter plot of the number of miles students traveled to and from school

​Next, calculate the median. The median is a value that divides a distribution of data in half so that half of all values in a set of data are above it, and half are below it. To find the median of a set of data:

​1.  Arrange all the values in the set in ascending numerical order.

Picture of values in set arranged in ascending order.

​​2.  If there are an odd number of values in the set of data, then the median is equal to the middle value. (Note: the number ‘1’ was added to the set of data for demonstration purposes only.)

Picture of values in set arranged in ascending order (odd number of values).

Picture of values in set arranged in ascending order (odd number of values).

3.  If there are an even number of values in the set of data, then the median will be equal to the average of the two middle values.

Picture of values in set arranged in ascending order (even number of values)
Median of even set of values; take the two middle values and divide it by two

The median can be interpreted in the following way: Half of the students in the sample travel less than 5 miles to and from school each day, and half of the students in the sample travel more than 5 miles to and from school each day.
 
When your data contains extreme values, it is more useful to report the median instead of the mean because the median is less affected by extreme values and will give you a more accurate representation of the data.
 
Case #2: If your data are nominal
A nominal variable is a variable that classifies observations into distinct categories, but the categories do not have a natural order. For example, marital status is a nominal variable that can be classified into five subcategories: 1 = Single (never married), 2 = Married, 3 = Separated, 4 = Widowed, and 5 = Divorced. The five subcategories, however, are attributes composing the variable marital status and cannot be quantified in a meaningful manner. Therefore, even if you rearrange the order in which the categories are listed (e.g., 1 = Single, 2 = Married, 3 = Separated, 4 = Widowed, 5 = Divorced to 1 = Widowed, 2 = Married, 3 = Separated, 4 = Divorced, 5 = Single), the variable marital status will still be interpreted the same way (i.e., a nominal variable comprising of 5 categories). In this case, trying to calculate the mean would not be appropriate because the numeric values you assign to different categories are used for labeling purposes only and have no quantitative significance.

Let’s look at an example using the variable marital status:

Say you are a researcher at a nonprofit organization, and you are asked to develop a demographics survey to better understand the marital status breakdown of your clients. One question on the survey asks clients to answer the following:

Survey Question: Please indicate your marital status from the options below. Answer choices include: Single; Married; Separated; Widowed; Divorced; Other; or Prefer not to answer.

​You assign value labels to each of the categories (e.g., Single  = 1, Married = 2, Separated = 3 etc…). The data below represent the responses you received from the survey sample (30 clients).

Example set of data produced from survey question asking about marital status. Values have been assigned to each category (e.g., Single = 1; Married = 2). There are 30 fictitious values.

​​You analyze the results and obtain the mean, a value of 3.70. You begin writing up your results when a colleague asks you, “How would you interpret a mean marital status of 3.70?” You then realize that it does not make sense to compute the mean for this variable because there is no way to interpret an average of 3.70. Calculating the mean of a nominal variable produces nonsensical results because the categories cannot be quantified in a meaningful way.

Possible Solution(s): Use frequency counts or percentages
The best way to report nominal data is to use frequency counts or percentages. Frequency tables and bar graphs are two popular choices for displaying nominal data.

​Frequency Table

Frequency table showing number and percentage of individuals who responded yes to each marital status category.

Bar Graph

Bar chart showing the percentage of individuals who responded yes to each marital status categoryy.

What are some other reasons not to use the mean as a measure for summarizing data? Please share your thoughts in the comments section below.

Previous
Previous

Survey Question Tip #4 - Avoid Ambiguity

Next
Next

Survey Question Tip #3 - Avoid multiple negatives