The Flavours of Average

Consider the following dot diagram that displays the heights, measured in cm, of 9 Gen Z humans.

Question 1. What is the average height of these 9 individuals?

You might think that the answer is automatically obtained by adding all heights then dividing by 9. We will touch base with this kind of average later on. The problem is that the word “average” can take on multiple meanings.

One meaning means, what is the height that most people would have? Clearly, the height with the largest number of dots is 167 cm.

This means that the height that the majority of this set of 7 persons has is 167 cm. We call this the mode of the data set.

An element in a data set is called a data point.

Definition 1. The mode of a data set is the value taken on by majority of the data points.

We could formalise this notion using more technical symbols, but I don’t think that is helpful for us, since we are not terribly interested in any rigorous analysis of the data set.

Example 1. Consider the pie chart below illustrating the favourite music artists in 2025.

Then the most popular artist, namely Drake, is the mode of the data set. I am surprised that Taylor Swift isn’t on this list. Don’t flame me.

Example 2. Consider the dot diagram we started with.

According to the diagram, the mode of the data set is 167 cm. If, however, we add another data set at 162 cm, we get the dot diagram below.

Then both 162 cm and 167 cm are modes of this data set. In this case, we call the data set bimodal.

This ambiguity could be a problem. We would like our answer to the “average” question to produce a unique answer.

To do that, we could interpret our data set as balancing on a beam with a pivot.

If the pivot is positioned at the 160 cm data point, then the entire beam would fall down. Likewise with the 170 cm data point.

Question 2. Where would we position the pivot to balance the beam?

Intuitively, the further apart the data set is positioned from the pivot, the greater the rotating effect (i.e. the moment). Assume each data point has equal “mass”. Then we would like to compute some “balance point” \bar x such that the sum of (x - \bar x) equals 0.

Denoting the data set by \{x_1,\dots, x_n\}, we require

\displaystyle (x_1 - \bar x) + (x_2 - \bar x) + \cdots + (x_n - \bar x) = 0.

Collecting the data points together,

\displaystyle (x_1 + x_2 + \cdots + x_n) - n\bar x = 0.

By writing \bar x in terms of the data points,

\displaystyle \bar x = \frac{x_1 + x_2 + \cdots + x_n}{n} \equiv \frac{ \Sigma x }{n},

where the Greek letter \Sigma (read ‘Sigma’) denotes the sum

\Sigma x \equiv x_1 + x_2 + \cdots + x_n.

Definition 2. The mean of a data set \{x_1, \dots, x_n\} is defined by \bar x := (\Sigma x)/n.

Example 3. Consider the dot diagram we started with again.

By evaluating their sum, \Sigma x = 1315. You can compute this result using either manual addition or by using the spreadsheet function SUM(...). Hence, \bar x = 1485/9 = 165\ \text{cm}.

Example 4. Consider the same dot diagram, but now, a Gen Z human with height 140 cm is included.

By evaluating their sum, \Sigma x = 1670. Hence, \bar x = 1630/10 = 163\ \text{cm}.

Clearly, however, the 140 cm human is an exceptional case (i.e. an outlier) among this group of humans. However, since the mean incorporates all possible heights, \bar x changed from 165 cm to 163 cm. The point of Example 4 is this: the mean is incredibly sensitive to outliers (though thankfully, there are various strategies to mitigate this effect).

Question 3. Can we obtain an average that is less sensitive to outliers?

Return to the original data set.

If we arranged the data points in non-decreasing order, we obtain the following non-decreasing sequence:

160 ≤ 162 ≤ 162 ≤ 165 ≤ 165 ≤ 167 ≤ 167 ≤ 167 ≤ 170

In this ordered sense, the average height is 165 cm.

If we did include the outlier data point 140 cm, we get two middle values:

140 ≤ 160 ≤ 162 ≤ 162 ≤ 165165 ≤ 167 ≤ 167 ≤ 167 ≤ 170

In this latter case, the middle-of-the-middle is the simple average:

\displaystyle \frac {165 + 165}{2} = 165,

and the average height unchanged at 165 cm. We call this value the median height.

Definition 3. The median Q_2 \equiv Q_2(x_1,\dots, x_n) of a sorted data set

x_1 \leq x_2 \leq \cdots \leq x_n

is defined by Q_2 := x_{(n+1)/2} if n is odd, and Q_2 := \frac 12 (x_{n/2} + x_{( n/2 ) + 1}) if n is even. We will explain the Q_2 notation in the next post.

Using our data set, the median height remains unchanged when given the extra data point 140 cm. However, the median height can change; if instead we had an additional data point 190 cm, then we get two new middle values:

160 ≤ 162 ≤ 162 ≤ 165 ≤ 165167 ≤ 167 ≤ 167 ≤ 170 ≤ 190.

In this case, our new median is Q_2 = \frac 12 (165 + 167) = 166\, \text{cm}.

Intuitively, however, we would prefer the median to the mean for its relative resilience against outliers.

The mean and median have modifications that allow us to discuss the relative spread of data. We will explore this idea next time.

—Joel Kindiak, 9 Feb 26, 1346H

,

Published by


Leave a comment