## required packages/modules
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from matplotlib import rcParams
import matplotlib.patches as patches
import matplotlib.patheffects as path_effects
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
import matplotlib.font_manager as fm
from IPython.display import display, HTML

## default font style
rcParams["font.family"] = "serif"

## format output
CSS = """
.output {
  margin-left:30;
}
"""

HTML('<style>{}</style>'.format(CSS))

def plot_text_ax(ax, foreground, **kwargs):
    """
    Function to plot text on axes.
    """
    text = ax.text(
        **kwargs
    )
    text.set_path_effects(
        [path_effects.withStroke(linewidth=3, foreground=foreground)]
    )

class FontManager:
    """Utility to load fun fonts from https://fonts.google.com/ for matplotlib.
    Find a nice font at https://fonts.google.com/, and then get its corresponding URL
    from https://github.com/google/fonts/. Lazily downloads the fonts.


    The FontManager is taken from the ridge_map package by Colin Carroll (@colindcarroll).

    Parameters
    ----------
    url : str, default is the url for Roboto-Regular.ttf
        Can really be any .ttf file, but probably looks like
        'https://github.com/google/fonts/blob/master/ofl/cinzel/static/Cinzel-Regular.ttf?raw=true'
        Note make sure the ?raw=true is at the end.

    Examples
    --------
    font_url = 'https://github.com/google/fonts/blob/master/ofl/abel/Abel-Regular.ttf?raw=true'
    fm = FontManager(url=font_url)
    fig, ax = plt.subplots()
    ax.text("Good content.", fontproperties=fm.prop, size=60)
    """

    def __init__(self,
                 url=('https://github.com/google/fonts/blob/master/'
                      'apache/roboto/static/Roboto-Regular.ttf?raw=true'),
                 style='normal'):
        self.url = url
        with NamedTemporaryFile(delete=False, suffix=".ttf") as temp_file:
            temp_file.write(urlopen(self.url).read())
            self._prop = fm.FontProperties(fname=temp_file.name, style=style)

    @property
    def prop(self):
        """Get matplotlib.font_manager.FontProperties object that sets the custom font."""
        return self._prop

    def __repr__(self):
        return f'{self.__class__.__name__}(font_url={self.url})'
    
fm = FontManager("https://github.com/google/fonts/blob/master/apache/roboto/static/Roboto-Regular.ttf?raw=true")

Introduction

  • A measure of central tendency is a summary statistic that represents the centre value of a dataset.

  • These measures indicate where most values in a dataset falls. In statistics, the three most common measures of central tendency are:

    1. Mean
    2. Median
    3. Mode
  • If we have a dataset of test scores for a particular class, the measures of central tendency can yield information such as the average test-score, the middle test-score, and the most frequently occurring test-score.

    Note: Measures of central tendency do not focus on the span of the dataset or how far values are from the middle numbers. The central tendency of a dataset represents only one characteristic of a dataset i.e. the center value.

Mean

Defining The Term

  • The mean is the arithmetic average of a group of observations (or numbers).

  • It is computed by summing all the observations and dividing by the total number of observations.

  • The population mean is represented by the Greek letter mu ($\mu$), and the sample mean is represented by $\overline{x}$.

  • The formulae for computing the population mean and the sample mean are given below:

    • Population Mean: $\mu = \frac{\Sigma x}{N} = \frac{x_{1} + x_{2} + x_{3} + .. + x_{N}}{N}$

    • Sample Mean: $\overline{x} = \frac{\Sigma x}{n} = \frac{x_{1} + x_{2} + x_{3} + .. + x_{n}}{n}$

  • Let's break down the formulae:

    • The capital Greek letter sigma($\Sigma$) is commonly used in mathematics to represent a summation of all the numbers in a grouping.

    • N is the number of observations in the population, and n is the number of observation in the sample.

Example

Example 01: The number of U.S. cars in service by top car rental companies in a recent year according to Auto Rental News follows.

Company Number of Cars in Service
Enterprise 643,000
Hertz 327,000
National/Alamo 233,000
Avis 204,000
Dollar/Thrifty 167,000
Budget 144,000
Advantage 20,000
U-Save 12,000
Payless 10,000
ACE 9,000
Fox 9,000
Rent-A-Wreck 7,000
Triangle 6,000

Compute the mean.

  • Solution:

Here we have a total of 12 observations, so N=13.

$\mu = $ $\frac{643000 + 327000 + 233000 + 204000 + 167000 + 144000 + 20000 + 12000 + 10000 + 9000 + 9000 + 7000 + 6000}{13}$

$\mu = $ $\frac{1791000}{13}$ = $137769.23$

  • Let's look how we can do the same in Python.

def calculate_mean(data):
    """
    Function to calculate mean.
    
    Args:
        data (list): containing numbers.
    
    Returns:
        float: the mean value.
    """
    # total number of observations
    N = len(data)
    
    # init total sum
    total_sum = 0
    
    for number in data:
        total_sum += number
    
    return total_sum / N

# dataset
data =[
    643000, 327000, 233000, 204000, 167000, 144000, 
    20000, 12000, 10000, 9000, 9000, 7000, 6000
]

# mean using user-defined function
mean_udf = calculate_mean(data)

# mean using numpy
mean_np = np.mean(data)

print(f"Mean (User Defined Function)= {round(mean_udf, 2)}")
print(f"Mean (NumPy)= {round(mean_np, 2)}")
Mean (User Defined Function)= 137769.23
Mean (NumPy)= 137769.23

The Outlier Problem

  • What is an outlier?

    • Outliers are the data points that are far from the other data points, i.e. they're unusual/unexpected values in a dataset.

    • e.g. in the scores 10, 25, 27, 29, 31, 34, 50 both 10 and 50 are "outliers".

# make subplots
fig, ax = plt.subplots(figsize=(26, 6), facecolor="black", dpi=600)
ax.set_facecolor("black")

# scatter points
ax.scatter(
    x=[25, 29, 32, 34, 27],
    y=[5]*5, s=1000, lw=1.5,
    fc="none", ec="#F2F2F2"
)

# scatter outlier
ax.scatter(
    [10, 50], [5, 5], s=1000, fc="none", ec="#F2F2F2", hatch=5*'/', zorder=1, lw=1.5,
)
    
# style for arrows
style = "Simple, tail_width=0.5, head_width=4, head_length=10"
kw = dict(arrowstyle=style, color="#D3D3D3")

# plot arrow for outlier
a1 = patches.FancyArrowPatch(
    (12.75, 4.915), (9.8, 4.985), connectionstyle="arc3,rad=-.7", **kw
)
ax.add_patch(a1)

# plot arrow for outlier
a1 = patches.FancyArrowPatch(
    (47.8, 4.915), (49.95, 4.985), connectionstyle="arc3,rad=.7", **kw
)
ax.add_patch(a1)

# add text
plot_text_ax(
    ax, "black", x=13.3, y=4.905, s="Value=10\nAn Outlier", color="#F2F2F2", size=18,
    bbox=dict(facecolor="none", edgecolor="#D3D3D3", boxstyle="round,pad=1"), 
    zorder=2, fontproperties=fm.prop
)

# add text
plot_text_ax(
    ax, "black", x=44.85, y=4.905, s="Value=50\nAn Outlier", color="#F2F2F2", size=18,
    bbox=dict(facecolor="none", edgecolor="#D3D3D3", boxstyle="round,pad=1"), 
    zorder=2, fontproperties=fm.prop
)

# add title
plot_text_ax(
    ax, "black", x=29, y=5.075, s="Out-Lier!!!", 
    size=40, color="#F2F2F2",
    ha="center", fontproperties=fm.prop
)

# add text
for i in [25, 29, 32, 34, 27]:
    plot_text_ax(
        ax, "black", x=i, y=5, s=i, color="#F2F2F2", 
        ha="center", va="center", 
        size=18, zorder=3, fontproperties=fm.prop
    )

# set axis
ax.set(ylim=(4.87, 5.1))

plt.show()
  • The one main disadvantage of the mean is its susceptibility to the influence of outliers.

  • As the data contains outliers the mean loses its ability to provide the best central location for the data because the outlier is dragging the mean away from the typical value.

  • Let's take an example to understand this. Let's say we have a dataset with following numbers: 50, 49, 55, 52, 53, 48, 49, 55, 56, 55, 50.

  • The mean($\mu$) of the dataset is:

    • $\mu = \frac{50 + 49 + 55 + 52 + 53 + 48 + 49 + 55 + 56 + 55 + 50}{11} = 52$
  • Let's now visualise the data and mean value.

# make subplots
fig, ax = plt.subplots(figsize=(12, 6), facecolor="black", dpi=600)
ax.set_facecolor("black")

# scatter points
ax.scatter(
    x=[50.1, 48, 49.1, 55.1, 52, 53, 49, 54.9, 56, 55, 50],
    y=[5]*11, s=1000, lw=1.5,
    fc="black", ec="#F2F2F2"
)

# add text
for i in [48, 49, 50, 52, 53, 55, 56]:
    plot_text_ax(
        ax, "black", x=i, y=5, s=i, color="#F2F2F2", 
        ha="center", va="center", 
        size=18, zorder=3, fontproperties=fm.prop
    )
    
# scatter the mean
ax.scatter(
    x=52, y=5, s=1000, lw=1.5, 
    fc="black", ec="#F2F2F2", hatch=5*'/'
)

# add mean text
plot_text_ax(
    ax, "black", x=52, y=4.93, s="Mean μ = 52", color="#F2F2F2", 
    ha="center", va="center", 
    size=20, zorder=3, fontproperties=fm.prop,
    bbox=dict(facecolor="none", edgecolor="#D3D3D3", boxstyle="round,pad=1")
)

# style for arrows
style = "Simple, tail_width=0.5, head_width=4, head_length=10"
kw = dict(arrowstyle=style, color="#D3D3D3")

# plot arrow for outlier
a1 = patches.FancyArrowPatch(
    (52, 4.95), (52, 4.988), **kw
)
ax.add_patch(a1)
    
# set axis
ax.set(ylim=(4.87, 5.1))

plt.show()
  • From the above figure, we can see that the mean value represents the centre value of the dataset.

  • Now, adding outliers to this dataset will change the mean value, and the outliers will draw the mean further away from the centre.

  • Let's add 85 and 84 to this dataset.

  • The new mean ($\mu_{new}$) becomes:

    • $\mu_{new} = \frac{50 + 49 + 55 + 52 + 53 + 48 + 49 + 55 + 56 + 55 + 50 + 85 + 84}{13} = 57$

# make subplots
fig, ax = plt.subplots(figsize=(24, 6), facecolor="black", dpi=600)
ax.set_facecolor("black")

# scatter points
ax.scatter(
    x=[50.1, 48, 49.1, 55.1, 52, 53, 49, 54.9, 56, 55, 50, 85, 84],
    y=[5]*13, s=1000, lw=1.5,
    fc="black", ec="#F2F2F2"
)

# add text
for i in [48, 49, 50, 52, 53, 55, 56, 85, 84]:
    plot_text_ax(
        ax, "black", x=i, y=5, s=i, color="#F2F2F2", 
        ha="center", va="center", 
        size=18, zorder=3, fontproperties=fm.prop
    )
    
# scatter the mean
ax.scatter(
    x=57, y=5, s=1000, lw=1.5, 
    fc="black", ec="#F2F2F2", hatch=5*'/'
)

# add mean text
plot_text_ax(
    ax, "black", x=57, y=4.93, s="Mean μ = 57", color="#F2F2F2", 
    ha="center", va="center", 
    size=20, zorder=3, fontproperties=fm.prop,
    bbox=dict(facecolor="none", edgecolor="#D3D3D3", boxstyle="round,pad=1")
)

# style for arrows
style = "Simple, tail_width=0.5, head_width=4, head_length=10"
kw = dict(arrowstyle=style, color="#D3D3D3")

# plot arrow for outlier
a1 = patches.FancyArrowPatch(
    (57, 4.95), (57, 4.988), **kw
)
ax.add_patch(a1)

# plot arrow for outlier
a1 = patches.FancyArrowPatch(
    (84.5, 4.95), (84.5, 4.988), **kw
)
ax.add_patch(a1)

# add outlier text
plot_text_ax(
    ax, "black", x=84.5, y=4.93, s="Outliers", color="#F2F2F2", 
    ha="center", va="center", 
    size=20, zorder=3, fontproperties=fm.prop,
    bbox=dict(facecolor="none", edgecolor="#D3D3D3", boxstyle="round,pad=1")
)
    
# set axis
ax.set(ylim=(4.87, 5.1))

plt.show()
  • The presence of outliers in the dataset has shifted the mean, and it no longer represents the center value of the dataset.

  • This problem occurs because the mean is affected by every value in the dataset. And if the dataset has larger or small values (i.e. outliers), it pulls the mean towards the extreme value. (as seen in the above figure)

  • So, if we have outliers in our dataset, the mean loses its ability to provide the best central location for the data because the outliers will drag the mean away from the typical value.

When To Use Mean

  • We use mean when both of the following conditions are met:

    • Data is scaled:

      • Data with equal intervals like speed, weight, height, temperature etc.
    • Data does not contain outliers:

      • The mean is sensitive to the outliers. We should only use mean when the dataset does not contain outliers.

Median

Defining The Term

  • The median is the middle value in an ordered array of numbers. It is the value that splits the dataset in half.

  • For an array with an odd number of terms, the median is the middle number.

  • For an array with an even number of terms, the median is the average/mean of the two middle numbers.

  • One needs to perform the followings steps to determine the median of a dataset:

    • Arrange the observations in an ordered data array.

    • For an odd number of terms, find the middle term of the ordered array. It is the median.

    • For an even number of terms, find the average of the middle two terms. This average/mean is the median.

  • Another way is to use the following formula:

    • $Median = \frac{n + 1}{2}^{th} term$

    • Here, n is the total number of observation in the dataset.

Example

Example 02: Determine the median of the following dataset: 15, 11, 14, 3, 21, 17, 22, 16, 19, 16, 5, 7, 19, 8, 9, 20, 4.

Solution:

Using The First Approach:

   Arrange the numbers in an ordered array: 3, 4, 5, 7, 8, 9, 11, 14, 15, 16, 16, 17, 19, 19, 20, 21, 22

   Since, the array contains 17 terms (an odd number of terms), the median is the middle value. i.e. 15.

Using formula:

   Here n = 17, so $Median = \frac{17 + 1}{2} = \frac{18}{2} = 9^{th} term = 15$

Example 03: Determine the median of the following dataset: 15, 11, 14, 3, 21, 17, 16, 19, 16, 5, 7, 19, 8, 9, 20, 4.

Solution:

Using The First Approach:

   Arrange the numbers in an ordered array: 3, 4, 5, 7, 8, 9, 11, 14, 15, 16, 16, 17, 19, 19, 20, 21

   Since, the array contains 16 terms (an even number of terms), the median is the average/mean of the two middle values. i.e. 14 and 15.

   $Median = \frac{14 + 15}{2} = 14.5$

Using formula:

   Here n = 16, so $Median = \frac{16 + 1}{2} = \frac{17}{2} = 8.5^{th} term$

   8.5th term means the median is located halfway between 8th and 9th term or the average/mean of 14 and 15 which is 14.5.

  • Let's look how we can do the same in Python.

def calculate_median(data):
    """
    Function to calculate median.
    
    Args:
        data (list): containing numbers.
    
    Returns:
        float: the median value.
    """
    # total number of observations
    N = len(data)
    
    # arrange the data
    data = sorted(data)
    
    if N % 2 == 0:
        median = (data[(N // 2) - 1] + data[N // 2]) / 2
    else:
        median = data[N // 2]
        
    return median

# dataset: even number of observation
data_even = [
    15, 11, 14, 3, 21, 17, 22, 16, 19, 16, 5, 7, 19, 8, 9, 20, 4
]

# dataset: odd number of observation
data_odd = [
    15, 11, 14, 3, 21, 17, 16, 19, 16, 5, 7, 19, 8, 9, 20, 4
]

print("⟿ For even number of observation")
print(f"    Median (User Defined Function): {calculate_median(data_even)}")
print(f"    Median (NumPy): {np.median(data_even)}")
print()
print("⟿ For odd number of observation")
print(f"    Median (User Defined Function): {calculate_median(data_odd)}")
print(f"    Median (NumPy): {np.median(data_odd)}")
⟿ For even number of observation
    Median (User Defined Function): 15
    Median (NumPy): 15.0

⟿ For odd number of observation
    Median (User Defined Function): 14.5
    Median (NumPy): 14.5

The Outlier Problem

  • The median is less affected by outliers. This property makes it a better option than the mean as a measure of central tendency.

  • Let's look at the same example we used in the case of mean. We have the following dataset initially: 50, 49, 55, 52, 53, 48, 49, 55, 56, 55, 50.

  • When we arrange the values in an ordered array, it looks like this: 48, 49, 49, 50, 50, 52, 53, 55, 55, 55, 56.

  • Since, we have 11 values in the dataset the median will be the middle value which is 52, same as the mean.

# make subplots
fig, ax = plt.subplots(figsize=(12, 6), facecolor="black", dpi=600)
ax.set_facecolor("black")

# scatter points
ax.scatter(
    x=[50.1, 48, 49.1, 55.1, 52, 53, 49, 54.9, 56, 55, 50],
    y=[5]*11, s=1000, lw=1.5,
    fc="black", ec="#F2F2F2"
)

# add text
for i in [48, 49, 50, 52, 53, 55, 56]:
    plot_text_ax(
        ax, "black", x=i, y=5, s=i, color="#F2F2F2", 
        ha="center", va="center", 
        size=18, zorder=3, fontproperties=fm.prop
    )
    
# scatter the mean
ax.scatter(
    x=52, y=5, s=1000, lw=1.5, 
    fc="black", ec="#F2F2F2", hatch=5*'/'
)

# add median text
plot_text_ax(
    ax, "black", x=52, y=4.93, s="Median = 52", color="#F2F2F2", 
    ha="center", va="center", 
    size=20, zorder=3, fontproperties=fm.prop,
    bbox=dict(facecolor="none", edgecolor="#D3D3D3", boxstyle="round,pad=1")
)

# style for arrows
style = "Simple, tail_width=0.5, head_width=4, head_length=10"
kw = dict(arrowstyle=style, color="#D3D3D3")

# plot arrow for outlier
a1 = patches.FancyArrowPatch(
    (52, 4.95), (52, 4.988), **kw
)
ax.add_patch(a1)
    
# set axis
ax.set(ylim=(4.87, 5.1))

plt.show()
  • Let's add 85 and 84 to this dataset.

  • Now the new dataset(ordered) looks like this: 48, 49, 49, 50, 50, 52, 53, 55, 55, 55, 56, 84, 85.

  • Since, we have 13 observations, so the new median value is the again the middle value of the new observation, i.e. 53

# make subplots
fig, ax = plt.subplots(figsize=(24, 6), facecolor="black", dpi=600)
ax.set_facecolor("black")

# scatter points
ax.scatter(
    x=[50.1, 48, 49.1, 55.1, 52, 53, 49, 54.9, 56, 55, 50, 84, 85],
    y=[5]*13, s=1000, lw=1.5,
    fc="black", ec="#F2F2F2"
)

# add text
for i in [48, 49, 50, 52, 53, 55, 56, 84, 85]:
    plot_text_ax(
        ax, "black", x=i, y=5, s=i, color="#F2F2F2", 
        ha="center", va="center", 
        size=18, zorder=3, fontproperties=fm.prop
    )
    
# scatter the mean
ax.scatter(
    x=53, y=5, s=1000, lw=1.5, 
    fc="black", ec="#F2F2F2", hatch=5*'/'
)

# add median text
plot_text_ax(
    ax, "black", x=53, y=4.93, s="Median = 53", color="#F2F2F2", 
    ha="center", va="center", 
    size=20, zorder=3, fontproperties=fm.prop,
    bbox=dict(facecolor="none", edgecolor="#D3D3D3", boxstyle="round,pad=1")
)

# style for arrows
style = "Simple, tail_width=0.5, head_width=4, head_length=10"
kw = dict(arrowstyle=style, color="#D3D3D3")

# plot arrow for outlier
a1 = patches.FancyArrowPatch(
    (53, 4.95), (53, 4.988), **kw
)
ax.add_patch(a1)
    
# set axis
ax.set(ylim=(4.87, 5.1))

plt.show()
  • The presence of outliers in the dataset has not shifted the median very much, and it stills represents the center value of the dataset.

  • The median is not sensitive to outliers. It is a better measure of central tendency when there are extremely large or small values in a data set.

When To Use Median

  • The median is used when either one of two conditions are met:

    • Data is ordinal.

    • Data contains outliers.

Disadvantages

  • While calculating the median, all the data should be arranged in ascending or in descending order. In the case of a large number of items, it becomes tedious and time-consuming.

  • It is a less representative average because it does not depend on all the items in the series, and hence is not used in many statistical tests.

Mode

Defining The Term

  • The mode is the most frequently occurring value in a set of data.

  • On a bar chart, the mode is the highest bar. If no value repeats, the data do not have a mode.

  • A dataset can also have more than one mode.

  • We can use mode for qualitative as well as quantitative data. On the other hand, mean/median can only be used for quantitative data.

  • The mode has its limitations too. In some datasets, the mode may not reflect the center of the dataset very well. The presence of more than one mode can limit the ability of the mode in describing the center. Mode should only be used when we want to find the frequently occuring value in our dataset.

Example

Example 04: Compute the mode of the following dataset: 1, 2, 2, 3, 4, 4, 5, 5, 5.

Solution: Since the value 5 occurs most frequently(3 times). So, 5 is the mode of the given distribution.

  • Let's look how we can do the same in Python.

def calculate_mode(data):
    """
    Function to calculate median.
    
    Args:
        data (list): containing numbers.
    
    Returns:
        float: the median value.
    """
    # init an empty dict
    mode_collect = {}
    
    for number in data:
        if mode_collect.get(number) is None:
            mode_collect[number] = 1
        else:
            mode_collect[number] += 1
        
    # get most frequent values
    mode = [k for k, v in mode_collect.items() if v == max(list(mode_collect.values()))]
    
    if len(mode) == len(data):
        return "No mode found!"
    else:
        return mode

# data
data = [1, 2, 2, 3, 4, 4, 5, 5, 5]

print(f"Mode (User Defined Function): {calculate_mode(data)}")
print(f"Mode (SciPy): {stats.mode(data)[0]}")
Mode (User Defined Function): [5]
Mode (SciPy): [5]

Questionnaire

Ques 01: Determine the median and mode for the following numbers:

     2, 4, 8, 4, 6, 2, 7, 8, 4, 3, 8, 9, 4, 3, 5

Ques 02: Determine the median for the following numbers:

     213, 345, 609, 73, 167, 243, 444, 524, 199, 682

Ques 03: Compute the mean for the following numbers:

     17.3, 44.5, 31.6, 40.0, 52.8, 38.8, 30.1, 78.5

Ques 04: Compute the mean for the following numbers:

     7, 5, -2, 9, 0, -3, -6, -7, -4, -5, 2, -8

2. If you face any problem or have any feedback/suggestions feel free to comment.