Outliers

Video masterclass

Topic summary

Outliers are data points that lie significantly outside the range of most other values in a dataset. Identifying outliers is important in statistics, as they can affect measures like the mean and standard deviation. There are two common methods to find outliers: using quartiles or using the mean and standard deviation.

1. Finding Outliers Using Quartiles:

Outliers can be identified using the interquartile range (IQR). The IQR is the range between the first quartile (\(Q_1\)) and the third quartile (\(Q_3\)).

Steps to Identify Outliers Using Quartiles:

  1. Find \(Q_1\) and \(Q_3\): Arrange the data in ascending order and calculate the lower quartile (\(Q_1\)) and upper quartile (\(Q_3\)).
  2. Calculate the IQR: Subtract \(Q_1\) from \(Q_3\): \[ \text{IQR} = Q_3 - Q_1 \]
  3. Determine the outlier thresholds: Use the following formulas to find the lower and upper thresholds:
    • Lower threshold: \(Q_1 - 1.5 \times \text{IQR}\)
    • Upper threshold: \(Q_3 + 1.5 \times \text{IQR}\)
  4. Identify outliers: Any data point below the lower threshold or above the upper threshold is an outlier.

Example: A dataset contains the following values: \(2, 4, 5, 7, 9, 12, 14, 18, 22\).

  • Step 1: Find \(Q_1\) and \(Q_3\):
    • \(Q_1 = 5\) (lower quartile)
    • \(Q_3 = 14\) (upper quartile)
  • Step 2: Calculate the IQR: \(Q_3 - Q_1 = 14 - 5 = 9\).
  • Step 3: Find the thresholds:
    • Lower threshold: \(5 - 1.5 \times 9 = -8.5\).
    • Upper threshold: \(14 + 1.5 \times 9 = 27.5\).
  • Step 4: Identify outliers: Any data point below \(-8.5\) or above \(27.5\) is an outlier. In this case, there are no outliers.

2. Finding Outliers Using Mean and Standard Deviation:

Outliers can also be identified by comparing data points to the mean and standard deviation of the dataset. A common rule is that any data point more than 2 or 3 standard deviations away from the mean is considered an outlier.

Steps to Identify Outliers Using Mean and Standard Deviation:

  1. Find the mean (\(\mu\)) and standard deviation (\(\sigma\)): Calculate the average and standard deviation of the dataset.
  2. Determine the thresholds: Use the following formulas to find the lower and upper thresholds:
    • Lower threshold: \(\mu - 2\sigma\) (or \(\mu - 3\sigma\))
    • Upper threshold: \(\mu + 2\sigma\) (or \(\mu + 3\sigma\))
  3. Identify outliers: Any data point outside the thresholds is an outlier.

Example: A dataset contains the following values: \(10, 12, 15, 18, 20, 25, 50\).

  • Step 1: Find the mean and standard deviation:
    • \(\mu = \frac{10 + 12 + 15 + 18 + 20 + 25 + 50}{7} = 21.43\).
    • \(\sigma = 13.27\) (calculated using the standard deviation formula).
  • Step 2: Determine the thresholds (using \(2\sigma\)):
    • Lower threshold: \(21.43 - 2 \times 13.27 = -5.11\).
    • Upper threshold: \(21.43 + 2 \times 13.27 = 47.97\).
  • Step 3: Identify outliers: Any data point outside \(-5.11\) and \(47.97\) is an outlier. Here, \(50\) is an outlier.

3. Summary:

  • Outliers can be found using the IQR method or the mean and standard deviation method.
  • The IQR method uses quartiles and thresholds of \(1.5 \times \text{IQR}\) to identify outliers.
  • The mean and standard deviation method identifies data points outside \(2\sigma\) or \(3\sigma\) from the mean as outliers.

Extra questions (ultimate exclusive)

Ultimate members get access to four additional questions with full video explanations.