## required packages/modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib import patches
from IPython.display import display, HTML

## default font style
rcParams["font.family"] = "serif"

## format output
CSS = """
.output {
  margin-left:20;
}
"""

HTML('<style>{}</style>'.format(CSS))

Overview

  • One of the most effective mechanisms for presenting data is through graphs and charts.

  • Through graphs and charts, the decision-makers can often get an overall picture of the data and reach some useful conclusions merely by studying the chart or graph.

  • We classify data graphs as quantitative or qualitative.

  • Quantitative data graphs are plotted along a numerical scale.

  • Qualitative data graphs are plotted using non-numerical categories.

What is a Histogram?

  • One of the more widely used types of graphs for quantitative data is the histogram.

  • A histogram is a series of contiguous bars or rectangles that represents the frequency of data in given class intervals.

Construction

  • The first step is to locate the class boundaries on the x-axis (horizontal axis) and frequencies on the y-axis (vertical axis).

  • And then construct a vertical rectangle on each line segment representing a class interval such that the height of the rectangle represents the frequency of the class interval.

Example

  • Let's we have the following frequency distribution:
Class Intervals Frequency
30.0 - under 40.0 1
40.0 - under 50.0 0
50.0 - under 60.0 5
60.0 - under 70.0 4
70.0 - under 80.0 15
80.0 - under 90.0 5
90.0 - under 100.0 7
  • Our first step is to locate class boundaries and frequencies.

def create_axis(
    xticks, yticks, xlim, ylim, 
    xlabel="Class Interval",
    ylabel="Frequency"
):
    """
    Function to create axis.
    
    Args:
        xticks (numpy.array): xtick values.
        yticks (numpy.array): ytick values.
        xlim (tuple): x-limit.
        ylim (tuple): y-limit.
        xlabel (str, optional): X label value.
        ylabel (str, optional): y label value.
    
    Returns:
        figure.Figure: figure object.
        axes.Axes: axes object.
    """
    ## create subplot
    fig, ax = plt.subplots(facecolor="#121212", figsize=(12,8))
    ax.set_facecolor("#121212")

    ## hide the all the spines
    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)

    ## change color
    ax.spines['bottom'].set_color("#F2F2F2")
    ax.spines['left'].set_color("#F2F2F2") 

    ## change color of tick params
    ax.tick_params(axis='x', colors="#F2F2F2")
    ax.tick_params(axis='y', colors="#F2F2F2")

    ## set ticks
    ax.set_xticks(np.round(xticks, 2))
    ax.set_yticks(np.round(yticks, 2))

    ## set labels
    ax.set_xlabel(xlabel, color="#F2F2F2", size=20)
    ax.set_ylabel(ylabel, color="#F2F2F2", size=20)

    ## setting the limit
    ax.set(xlim=xlim, ylim=ylim)

    ## credits
    fig.text(
        0.9, 0.02, "graphic: @slothfulwave612", 
        fontsize=10, fontstyle="italic", color="#F2F2F2",
        ha="right", va="center"
    )
    
    return fig, ax

fig, ax = create_axis(
    xticks=np.linspace(30,100,8), yticks=np.linspace(0,15,16), xlim=(30,101), ylim=(0,15)
)
plt.show()
  • Now, as we have class-boundaries and frequencies listed, now for each class interval we will plot the histogram.

  • So for our first class-interval, the frequency is 1. So the bar length (in the vertical direction) will touch 1 mark on the y-axis, just like this:

## plot first bin
ax.hist(
    x=[33], bins=[30,40], edgecolor="#F2F2F2", linewidth=1, color="#121212", hatch=1*"/"
)
fig
  • For our second class-interval (i.e. 40 - under 50) so no bar will be made.

  • For our third class-interval (i.e. 50 - under 60), the frequency is 5. So the bar length (in the vertical direction) will touch 5 mark on the y-axis, just like this:

## plot third bin
ax.hist(
    x=[52, 55, 53, 58, 57], bins=[50,60], edgecolor="#F2F2F2", 
    linewidth=1, color="#121212", hatch=1*'/'
)
fig
  • And just like this the process will continue till the whole frequency-table is plotted.

## all test scores
test_scores = [
    52, 92, 84, 74, 65, 55, 78, 95, 62, 72, 64, 
    74, 82, 94, 71, 79, 73, 94, 77, 53, 
    77, 87, 97, 57, 72, 89, 76, 91, 86, 
    99, 71, 73, 58, 76, 33, 78, 69
]

## create new plot
fig, ax = create_axis(
    xticks=np.linspace(30,100,8), yticks=np.linspace(0,15,16), xlim=(30,101), ylim=(0,15)
)

## plot whole histogram
ax.hist(
    x=test_scores, bins=np.linspace(30,100,8),
    edgecolor="#F2F2F2", linewidth=1, color="#121212", hatch=1*"/"
)

plt.show()
  • A histogram is a useful tool for differentiating the frequencies of class intervals. A glance at a histogram reveals which class intervals produce the highest frequency totals.

    • The above figure clearly shows that the class interval 70 - under 80 yields by far the highest frequency count (15)
  • Examination of the histogram reveals where large increases or decreases occur between classes.

    • Such as, from 40 - under 50 class to the 50 - under 60 class, an increase of 5, from 60 - under 70 class to the 70 - under 80 class, an increase of 11, and from 70 - under 80 class to the 80 - under 90 class, a decrease of 10.
      Note: If you use different scales for the x-axis and y-axis, the resultant histograms will look different from the one plotted above. An example below:

## create new plot
fig, ax = create_axis(
    xticks=np.linspace(30, 100, 11), yticks=np.linspace(0, 15, 4), xlim=(30,101), ylim=(0,15)
)

## plot whole histogram
ax.hist(
    x=test_scores, bins=np.linspace(30, 100, 11),
    edgecolor="#F2F2F2", linewidth=1, color="#121212", hatch=1*"/"
)

plt.show()

Note: It is important that the user of the graph clearly understands the scales used for the axes of a histogram. Otherwise, a graph’s creator can “lie with statistics” by stretching or compressing a graph to make a point.

Histograms with non-uniform widths

  • The histograms we plotted above have equal class-widths.

  • Now the question arises what if the class-widths are unequal? How to create histograms with unequal class-widths? This section answers this question.

  • Let's first take an example to see what happens if we plot a histogram (with unequal class widths) same as the way we make a histogram (with equal class widths).

  • Suppose our data looks like this:

Class Interval Frequency
0 - under 10 10
10 - under 20 20
20 - under 40 30
  • Here the class width for the third class-interval is not equal to the rest.

  • So, if we drew it in the same way, the final histogram will look like this:

data = [
    8, 6, 0, 4, 5, 3, 2, 4, 3, 5,
    10, 10, 17, 16, 13, 12, 18, 16, 10, 14, 18, 14, 14, 15, 15, 11, 16, 17, 10, 13,
    32, 39, 39, 30, 30, 23, 27, 37, 25, 23, 34, 38, 26, 28, 23, 39, 28,
    38, 20, 39, 20, 31, 29, 37, 38, 26, 20, 20, 21, 37
]

# create new plot
fig, ax = create_axis(
    xticks=np.linspace(0, 40, 5), yticks=np.linspace(0, 30, 4), xlim=(0,41), ylim=(0,31)
)

# plot whole histogram
ax.hist(
    x=data, bins=[0, 10, 20, 40],
    edgecolor="#F2F2F2", linewidth=1, color="#121212"
)

# annotate bars
ax.text(
    5, 5, "bar 1", size=15, ha="center", va="center", color="#F2F2F2"
)
ax.text(
    15, 10, "bar 2", size=15, ha="center", va="center", color="#F2F2F2"
)
ax.text(
    30, 15, "bar 3", size=15, ha="center", va="center", color="#F2F2F2"
)


plt.show()
  • The problem here is that this is not a good way of representing the given data because it doesn't look right.

  • To show why the representation is not right, let us draw some lines and annotate the size of resulting rectangles.

# create new plot
fig, ax = create_axis(
    xticks=np.linspace(0, 40, 5), yticks=np.linspace(0, 30, 4), xlim=(0,41), ylim=(0,31)
)

# plot whole histogram
ax.hist(
    x=data, bins=[0, 10, 20, 40],
    edgecolor="#F2F2F2", linewidth=1, color="#121212"
)

ax.plot(
    [30, 30], [0, 20], lw=1, color="#F2F2F2", ls="--"
)
ax.plot(
    [20, 40], [20, 20], lw=1, color="#F2F2F2", ls="--"
)

a1 = patches.FancyArrowPatch(
    (20,21), (40,21), 
    arrowstyle="<|-|>,head_length=5,head_width=5", color="#F2F2F2", alpha=0.7
)
ax.add_patch(a1)

a1 = patches.FancyArrowPatch(
    (29,0), (29,20), 
    arrowstyle="<|-|>,head_length=5,head_width=5", color="#F2F2F2", alpha=0.7
)
ax.add_patch(a1)

ax.text(
    28.2, 11, "20 units", color="#F2F2F2", size=12, rotation=90,
    va="top"
)
ax.text(
    30, 21.5, "20 units", color="#F2F2F2", size=12, ha="center"
)

ax.text(
    30, 25, "Same area as bar 2", size=15, ha="center", color="#F2F2F2"
)
ax.text(
    35, 10, "Same area\nas\nbar 2", size=15, ha="center", color="#F2F2F2"
)
ax.text(
    25, 10, "Same area\nas\nbar 2", size=15, ha="center", color="#F2F2F2"
)
ax.text(
    15, 10, "bar 2", size=15, ha="center", color="#F2F2F2"
)

plt.show()
  • After breaking down bar 3 we can see that the area of bar 3 is three-times the area of bar 2.

  • That means, bar 3 will represent a number which will be three-times the number represented by bar 2.

  • bar 2 is representing 20, so that means bar 3 should represent a number 60 as frequency. But that's not the case. bar 3 represents the frequency as 30. (which is not three-times the number 20)

  • So, this is not the way we represent a histogram with class-intervals having unequal widths.

  • To solve this issue, we calculate frequency density and is calculated by the following equation:

    $Frequency Density = \frac{Frequency}{Class Width}$

  • Frequency Density: It gives the frequency per unit for the data in this class, where the unit is the unit of measurement of the data.

  • So, when we add the frequency density column in our given data, the resulting data set will now look like this:

Class Interval Frequency Frequency Density
0 - under 10 10 10/10 = 1
10 - under 20 20 20/10 = 2
20 - under 40 30 30/20 = 1.5
  • Now, we can plot our histogram. (with class-intervals on x-axis and frequency-density on y-axis)

# given data
data = [
    8, 6, 0, 4, 5, 3, 2, 4, 3, 5,
    10, 10, 17, 16, 13, 12, 18, 16, 10, 14, 18, 14, 14, 15, 15, 11, 16, 17, 10, 13,
    32, 39, 39, 30, 30, 23, 27, 37, 25, 23, 34, 38, 26, 28, 23, 39, 28,
    38, 20, 39, 20, 31, 29, 37, 38, 26, 20, 20, 21, 37
]

# bins
bins = np.array([0, 10, 20, 40])

# class-widths
class_widths = bins[1:] - bins[:-1]

# frequency
frequency = np.histogram(data, bins=bins)[0]

# frequency-density
freq_dens = frequency / class_widths

# create new plot
fig, ax = create_axis(
    xticks=np.linspace(0, 40, 5), yticks=np.linspace(0, 3, 4), xlim=(0,41), ylim=(0,3),
    ylabel="Frequency Density"
)

# plot bars
ax.fill_between(bins.repeat(2)[1:-1], freq_dens.repeat(2),
                fc="#121212", ec="#F2F2F2", hatch=1*'/', lw=1, zorder=1)

# plot lines
for i in range(0, len(freq_dens) - 1):
    ax.plot(
        [bins[i + 1], bins[i + 1]], [0, freq_dens[i]], color="#F2F2F2", zorder=2, lw=1
    )


plt.show()
  • Here, area of the bar is equal to the frequency of the given class-interval.

    • Area of bar1 = 10 x 1 = 10

    • Area of bar2 = 10 x 2 = 20

    • Area of bar3 = 20 x 1.5 = 30

  • So, this is how we construct a histogram with unequal class widths.

Conclusion

  • If the class intervals used along the horizontal axis are equal, then the height of the bars represents the frequency of values in a given class interval.

  • If the class intervals are unequal, then the areas of the bars are used for relative comparisons of class frequencies.

Questionnaire

Ques 01: Construct a histogram for the following data:

Class Interval Frequency
30 - under 32 5
32 - under 34 7
34 - under 36 15
36 - under 38 21
38 - under 40 34
40 - under 42 24
42 - under 44 17
44 - under 46 8

Ques 02: Construct a histogram for the following data:

Class Interval Frequency
0 - under 10 5
10 - under 20 7
20 - under 25 15
25 - under 30 21
30 - under 40 34
40 - under 60 24
60 - under 90 17
90 - under 100 8

1. Notes are compiled from TLMaths and Business Statistics by Ken Black

2. If you face any problem or have any feedback/suggestions feel free to comment.