Tutorial 06: Histograms
Introduction to Quantitative Data Graphs with histograms.
- Overview
- What is a Histogram?
- Construction
- Histograms with non-uniform widths
- Conclusion
- Questionnaire
## required packages/modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib import patches
from IPython.display import display, HTML
## default font style
rcParams["font.family"] = "serif"
## format output
CSS = """
.output {
margin-left:20;
}
"""
HTML('<style>{}</style>'.format(CSS))
Overview
-
One of the most effective mechanisms for presenting data is through graphs and charts.
-
Through graphs and charts, the decision-makers can often get an overall picture of the data and reach some useful conclusions merely by studying the chart or graph.
-
We classify data graphs as quantitative or qualitative.
-
Quantitative data graphs are plotted along a numerical scale.
-
Qualitative data graphs are plotted using non-numerical categories.
What is a Histogram?
-
One of the more widely used types of graphs for quantitative data is the histogram.
-
A histogram is a series of contiguous bars or rectangles that represents the frequency of data in given class intervals.
Construction
-
The first step is to locate the class boundaries on the x-axis (horizontal axis) and frequencies on the y-axis (vertical axis).
-
And then construct a vertical rectangle on each line segment representing a class interval such that the height of the rectangle represents the frequency of the class interval.
Example
- Let's we have the following frequency distribution:
Class Intervals | Frequency |
---|---|
30.0 - under 40.0 | 1 |
40.0 - under 50.0 | 0 |
50.0 - under 60.0 | 5 |
60.0 - under 70.0 | 4 |
70.0 - under 80.0 | 15 |
80.0 - under 90.0 | 5 |
90.0 - under 100.0 | 7 |
- Our first step is to locate class boundaries and frequencies.
def create_axis(
xticks, yticks, xlim, ylim,
xlabel="Class Interval",
ylabel="Frequency"
):
"""
Function to create axis.
Args:
xticks (numpy.array): xtick values.
yticks (numpy.array): ytick values.
xlim (tuple): x-limit.
ylim (tuple): y-limit.
xlabel (str, optional): X label value.
ylabel (str, optional): y label value.
Returns:
figure.Figure: figure object.
axes.Axes: axes object.
"""
## create subplot
fig, ax = plt.subplots(facecolor="#121212", figsize=(12,8))
ax.set_facecolor("#121212")
## hide the all the spines
ax.spines["right"].set_visible(False)
ax.spines["top"].set_visible(False)
## change color
ax.spines['bottom'].set_color("#F2F2F2")
ax.spines['left'].set_color("#F2F2F2")
## change color of tick params
ax.tick_params(axis='x', colors="#F2F2F2")
ax.tick_params(axis='y', colors="#F2F2F2")
## set ticks
ax.set_xticks(np.round(xticks, 2))
ax.set_yticks(np.round(yticks, 2))
## set labels
ax.set_xlabel(xlabel, color="#F2F2F2", size=20)
ax.set_ylabel(ylabel, color="#F2F2F2", size=20)
## setting the limit
ax.set(xlim=xlim, ylim=ylim)
## credits
fig.text(
0.9, 0.02, "graphic: @slothfulwave612",
fontsize=10, fontstyle="italic", color="#F2F2F2",
ha="right", va="center"
)
return fig, ax
fig, ax = create_axis(
xticks=np.linspace(30,100,8), yticks=np.linspace(0,15,16), xlim=(30,101), ylim=(0,15)
)
plt.show()
-
Now, as we have class-boundaries and frequencies listed, now for each class interval we will plot the histogram.
-
So for our first class-interval, the frequency is 1. So the bar length (in the vertical direction) will touch 1 mark on the y-axis, just like this:
## plot first bin
ax.hist(
x=[33], bins=[30,40], edgecolor="#F2F2F2", linewidth=1, color="#121212", hatch=1*"/"
)
fig
-
For our second class-interval (i.e. 40 - under 50) so no bar will be made.
-
For our third class-interval (i.e. 50 - under 60), the frequency is 5. So the bar length (in the vertical direction) will touch 5 mark on the y-axis, just like this:
## plot third bin
ax.hist(
x=[52, 55, 53, 58, 57], bins=[50,60], edgecolor="#F2F2F2",
linewidth=1, color="#121212", hatch=1*'/'
)
fig
- And just like this the process will continue till the whole frequency-table is plotted.
## all test scores
test_scores = [
52, 92, 84, 74, 65, 55, 78, 95, 62, 72, 64,
74, 82, 94, 71, 79, 73, 94, 77, 53,
77, 87, 97, 57, 72, 89, 76, 91, 86,
99, 71, 73, 58, 76, 33, 78, 69
]
## create new plot
fig, ax = create_axis(
xticks=np.linspace(30,100,8), yticks=np.linspace(0,15,16), xlim=(30,101), ylim=(0,15)
)
## plot whole histogram
ax.hist(
x=test_scores, bins=np.linspace(30,100,8),
edgecolor="#F2F2F2", linewidth=1, color="#121212", hatch=1*"/"
)
plt.show()
-
A histogram is a useful tool for differentiating the frequencies of class intervals. A glance at a histogram reveals which class intervals produce the highest frequency totals.
- The above figure clearly shows that the class interval 70 - under 80 yields by far the highest frequency count (15)
-
Examination of the histogram reveals where large increases or decreases occur between classes.
- Such as, from 40 - under 50 class to the 50 - under 60 class, an increase of 5, from 60 - under 70 class to the 70 - under 80 class, an increase of 11, and from 70 - under 80 class to the 80 - under 90 class, a decrease of 10.
Note: If you use different scales for the x-axis and y-axis, the resultant histograms will look different from the one plotted above. An example below:
- Such as, from 40 - under 50 class to the 50 - under 60 class, an increase of 5, from 60 - under 70 class to the 70 - under 80 class, an increase of 11, and from 70 - under 80 class to the 80 - under 90 class, a decrease of 10.
## create new plot
fig, ax = create_axis(
xticks=np.linspace(30, 100, 11), yticks=np.linspace(0, 15, 4), xlim=(30,101), ylim=(0,15)
)
## plot whole histogram
ax.hist(
x=test_scores, bins=np.linspace(30, 100, 11),
edgecolor="#F2F2F2", linewidth=1, color="#121212", hatch=1*"/"
)
plt.show()
Histograms with non-uniform widths
-
The histograms we plotted above have equal class-widths.
-
Now the question arises what if the class-widths are unequal? How to create histograms with unequal class-widths? This section answers this question.
-
Let's first take an example to see what happens if we plot a histogram (with unequal class widths) same as the way we make a histogram (with equal class widths).
-
Suppose our data looks like this:
Class Interval | Frequency |
---|---|
0 - under 10 | 10 |
10 - under 20 | 20 |
20 - under 40 | 30 |
-
Here the class width for the third class-interval is not equal to the rest.
-
So, if we drew it in the same way, the final histogram will look like this:
data = [
8, 6, 0, 4, 5, 3, 2, 4, 3, 5,
10, 10, 17, 16, 13, 12, 18, 16, 10, 14, 18, 14, 14, 15, 15, 11, 16, 17, 10, 13,
32, 39, 39, 30, 30, 23, 27, 37, 25, 23, 34, 38, 26, 28, 23, 39, 28,
38, 20, 39, 20, 31, 29, 37, 38, 26, 20, 20, 21, 37
]
# create new plot
fig, ax = create_axis(
xticks=np.linspace(0, 40, 5), yticks=np.linspace(0, 30, 4), xlim=(0,41), ylim=(0,31)
)
# plot whole histogram
ax.hist(
x=data, bins=[0, 10, 20, 40],
edgecolor="#F2F2F2", linewidth=1, color="#121212"
)
# annotate bars
ax.text(
5, 5, "bar 1", size=15, ha="center", va="center", color="#F2F2F2"
)
ax.text(
15, 10, "bar 2", size=15, ha="center", va="center", color="#F2F2F2"
)
ax.text(
30, 15, "bar 3", size=15, ha="center", va="center", color="#F2F2F2"
)
plt.show()
-
The problem here is that this is not a good way of representing the given data because it doesn't look right.
-
To show why the representation is not right, let us draw some lines and annotate the size of resulting rectangles.
# create new plot
fig, ax = create_axis(
xticks=np.linspace(0, 40, 5), yticks=np.linspace(0, 30, 4), xlim=(0,41), ylim=(0,31)
)
# plot whole histogram
ax.hist(
x=data, bins=[0, 10, 20, 40],
edgecolor="#F2F2F2", linewidth=1, color="#121212"
)
ax.plot(
[30, 30], [0, 20], lw=1, color="#F2F2F2", ls="--"
)
ax.plot(
[20, 40], [20, 20], lw=1, color="#F2F2F2", ls="--"
)
a1 = patches.FancyArrowPatch(
(20,21), (40,21),
arrowstyle="<|-|>,head_length=5,head_width=5", color="#F2F2F2", alpha=0.7
)
ax.add_patch(a1)
a1 = patches.FancyArrowPatch(
(29,0), (29,20),
arrowstyle="<|-|>,head_length=5,head_width=5", color="#F2F2F2", alpha=0.7
)
ax.add_patch(a1)
ax.text(
28.2, 11, "20 units", color="#F2F2F2", size=12, rotation=90,
va="top"
)
ax.text(
30, 21.5, "20 units", color="#F2F2F2", size=12, ha="center"
)
ax.text(
30, 25, "Same area as bar 2", size=15, ha="center", color="#F2F2F2"
)
ax.text(
35, 10, "Same area\nas\nbar 2", size=15, ha="center", color="#F2F2F2"
)
ax.text(
25, 10, "Same area\nas\nbar 2", size=15, ha="center", color="#F2F2F2"
)
ax.text(
15, 10, "bar 2", size=15, ha="center", color="#F2F2F2"
)
plt.show()
-
After breaking down bar 3 we can see that the area of bar 3 is three-times the area of bar 2.
-
That means, bar 3 will represent a number which will be three-times the number represented by bar 2.
-
bar 2 is representing 20, so that means bar 3 should represent a number 60 as frequency. But that's not the case. bar 3 represents the frequency as 30. (which is not three-times the number 20)
-
So, this is not the way we represent a histogram with class-intervals having unequal widths.
-
To solve this issue, we calculate frequency density and is calculated by the following equation:
$Frequency Density = \frac{Frequency}{Class Width}$
-
Frequency Density: It gives the frequency per unit for the data in this class, where the unit is the unit of measurement of the data.
-
So, when we add the frequency density column in our given data, the resulting data set will now look like this:
Class Interval | Frequency | Frequency Density |
---|---|---|
0 - under 10 | 10 | 10/10 = 1 |
10 - under 20 | 20 | 20/10 = 2 |
20 - under 40 | 30 | 30/20 = 1.5 |
- Now, we can plot our histogram. (with class-intervals on x-axis and frequency-density on y-axis)
# given data
data = [
8, 6, 0, 4, 5, 3, 2, 4, 3, 5,
10, 10, 17, 16, 13, 12, 18, 16, 10, 14, 18, 14, 14, 15, 15, 11, 16, 17, 10, 13,
32, 39, 39, 30, 30, 23, 27, 37, 25, 23, 34, 38, 26, 28, 23, 39, 28,
38, 20, 39, 20, 31, 29, 37, 38, 26, 20, 20, 21, 37
]
# bins
bins = np.array([0, 10, 20, 40])
# class-widths
class_widths = bins[1:] - bins[:-1]
# frequency
frequency = np.histogram(data, bins=bins)[0]
# frequency-density
freq_dens = frequency / class_widths
# create new plot
fig, ax = create_axis(
xticks=np.linspace(0, 40, 5), yticks=np.linspace(0, 3, 4), xlim=(0,41), ylim=(0,3),
ylabel="Frequency Density"
)
# plot bars
ax.fill_between(bins.repeat(2)[1:-1], freq_dens.repeat(2),
fc="#121212", ec="#F2F2F2", hatch=1*'/', lw=1, zorder=1)
# plot lines
for i in range(0, len(freq_dens) - 1):
ax.plot(
[bins[i + 1], bins[i + 1]], [0, freq_dens[i]], color="#F2F2F2", zorder=2, lw=1
)
plt.show()
-
Here, area of the bar is equal to the frequency of the given class-interval.
-
Area of bar1 = 10 x 1 = 10
-
Area of bar2 = 10 x 2 = 20
-
Area of bar3 = 20 x 1.5 = 30
-
-
So, this is how we construct a histogram with unequal class widths.
Questionnaire
Ques 01: Construct a histogram for the following data:
Class Interval | Frequency |
---|---|
30 - under 32 | 5 |
32 - under 34 | 7 |
34 - under 36 | 15 |
36 - under 38 | 21 |
38 - under 40 | 34 |
40 - under 42 | 24 |
42 - under 44 | 17 |
44 - under 46 | 8 |
Ques 02: Construct a histogram for the following data:
Class Interval | Frequency |
---|---|
0 - under 10 | 5 |
10 - under 20 | 7 |
20 - under 25 | 15 |
25 - under 30 | 21 |
30 - under 40 | 34 |
40 - under 60 | 24 |
60 - under 90 | 17 |
90 - under 100 | 8 |
1. Notes are compiled from TLMaths and Business Statistics by Ken Black↩
2. If you face any problem or have any feedback/suggestions feel free to comment.↩