## required packages/modules
import numpy as np
import pandas as pd

What is Frequency Distribution

  • A frequency distribution is an overview of all distinct values in the data set and the number of times they occur.

  • It is a tool for grouping data.

  • In a frequency distribution the data is presented in the form of class interval and frequencies.

  • We define frequency in statistics as the number of times the observation occurred/recorded in an experiment or study.

    • For example, if we record test scores of a class and recorded marks of 50 students, then our frequency will be 50.
  • The range of each group of data is called class interval.

    • For example, we record the test scores of a class. The marks are between 0 and 100. We put these marks into groups.

    • A class interval of 20 marks has these groups

      • 0 to just below 20

      • 20 to just below 40

      • 40 to just below 60

      • 60 to just below 80

      • 80 to 100

    • But a class interval of 50 marks has these groups

      • 0 to just below 50

      • 50 to 100

Construction

  • Let's say we record the test scores of a class and have the following result:
test_scores = [52, 92, 84, 74, 65, 55, 78, 95, 62, 72, 64, 
               74, 82, 94, 71, 79, 73, 94, 77, 53, 
               77, 87, 97, 57, 72, 89, 76, 91, 86, 
               99, 71, 73, 58, 76, 33, 78, 69]
  • When constructing a frequency distribution, we first determine the range of the data.

    • We define the range as the difference between the largest and smallest numbers in the data set.

    • So, in our data set, the range is $99-33=66$.

  • The second step in to determine how many classes it will contain.

    • Too many classes or too few classes might not reveal the basic shape of the data set. Also, it will be difficult to interpret such frequency distribution.

    • The ideal number of classes may be determined or estimated by formula:

      • number of classes (C) = $1 + (3.3 * log_{10}n)$, or

      • number of classes (C) = $\sqrt{n}$, where n is the total number of observation in the data.

      • However, these formulas are not a hard rule and the resulting number of classes determined by formula may not always be exactly suitable with the data being dealt with.

      • Also, one rule of thumb is to select between 5 and 15 classes.

      • For our data set, using the above formula, we have the resulting number of classes equal to 6. (n=37, for our data set)

  • After selecting the number of classes, we determine the width of the class interval.

    • We calculate the class-width by dividing the range by the number of classes.

    • $width(h) = \frac{range}{number of classes}$

    • For the data, the approximation would be 66/6=11.

  • The frequency distribution must start at a value equal to or lower than the lowest number of the ungrouped data and end at a value equal to or higher than the highest number.

    • The lowest test score is 33, and the highest is 99. So, our frequency distribution will start at 33 and ends at 99.
  • At the end, the frequency distribution will look like this:

Class Interval Frequency
33 - under 44 1
44 - under 55 2
55 - under 66 6
66 - under 77 11
77 - under 88 9
88 - under 99 8
Total 37

Note: The above table is one way of making frequency distribution for the given data, if you pick different range/width/number_of_classes, the table will look differently.

Class Midpoint

  • The midpoint of each class interval is called the class midpoint and is sometimes referred to as the class marks.

  • It can be calculated as the average of the two-class endpoints.

  • For example, in the above distribution, the midpoint of the class interval 33 - under 44 is $(33+44)/2 = 38.5$

  • If we include class midpoint in the above table, the final table will look like this:

Class Interval Frequency Class Midpoint
33 - under 44 1 38.5
44 - under 55 2 49.5
55 - under 66 6 60.5
66 - under 77 11 71.5
77 - under 88 9 82.5
88 - under 99 8 93.5
Total 37 -
  • If we take any two consecutive class midpoints subtract them and take absolute value we get the range for our distribution.

    • e.g, $|71.5 - 60.5| = 11$ or $|49.5 - 60.5| = 11$, which is our range value.

Relative Frequency

  • Relative frequency is the proportion of the total frequency that is in any given class interval in a frequency distribution.

  • Relative frequency is the individual class frequency divided by the total frequency.

  • For example, the relative class frequency of class 66 - under 77 is $11 / 37 = 0.297$

    Important: If we select values randomly from the above frequency distribution, the probability of drawing a number that is “66 – under 77” would be .297, the relative frequency for that class interval.
  • If we include relative frequency in our frequency distribution, the table will look like this:
Class Interval Frequency Class Midpoint Relative Frequence
33 - under 44 1 38.5 0.02702
44 - under 55 2 49.5 0.05405
55 - under 66 6 60.5 0.16216
66 - under 77 11 71.5 0.29729
77 - under 88 9 82.5 0.24324
88 - under 99 8 93.5 0.21621
Total 37 - -

Cumulative Frequency

  • The cumulative frequency is a running total of frequencies through the classes of a frequency distribution.

  • The cumulative frequency for each class interval is the frequency of that class interval added to the preceding cumulative total.

  • The cumulative frequency for the first class is the same as the class frequency: 1. The cumulative frequency for the second class interval is the frequency of that interval (2) plus the frequency of the first interval (1), which yields a new cumulative frequency of 3.

  • This process continues through the last interval, at which point the cumulative total equals the sum of the frequencies (37).

  • After adding the cumulative frequency column to our frequency distribution, the table will look like this:

Class Interval Frequency Class Midpoint Relative Frequence Cumulative Frequency
33 - under 44 1 38.5 0.02702 1
44 - under 55 2 49.5 0.05405 3
55 - under 66 6 60.5 0.16216 9
66 - under 77 11 71.5 0.29729 20
77 - under 88 9 82.5 0.24324 29
88 - under 99 8 93.5 0.21621 37
Total 37 - - -

## frequency distribution using Python

def make_frequency_distribution(data, user_input=None, extra=True):
    """
    Function to make frequency distribution.
    
    Args:
        data (numpy.array): data containing records.
        user_input (tuple, optional): 
                    user_input for start_value, end_value, total_classes.
                    Defaults to None.
        extra (bool, optional): to make extra columns like cumulative, relative frequency.
    
    Returns:
        pandas.DataFrame: required frequency distribution.
    """
    ## total number of observations
    length = len(data)

    ## lowest and highest number in the data
    lowest = min(data)
    highest = max(data)
    
    ## total number of class 
    if user_input == None:    
        total_classes = int(np.sqrt(length))
    else:
        lowest, highest, total_classes = user_input

    ## range of the data
    range_ = highest - lowest
    
    print(f"Start value: {lowest}")
    print(f"End value: {highest}")
    print(f"Range: {range_}")
    print(f"Total Number of Classes: {total_classes}")
    
    ## calculate width
    width = range_ / total_classes
    
    ## list of all class intervals
    class_intervals = [
        np.round(start,3) for start in np.linspace(lowest, highest, total_classes+1)
    ]
    
    print(f"Class Width = {np.round(width, 3)}", end="\n\n")
    
    ## calculate frequency for each class
    hist, _ = np.histogram(data, bins=class_intervals)
    
    ## frequency table
    df = pd.DataFrame(
        {
            "Class Intervals": [
                f"{first} - under {second}" \
                for first, second in zip(class_intervals, class_intervals[1:])
            ],
            "Frequency": hist
        }
    )
    
    if extra:
        ## class midpoint
        df["Class Midpoint"] = df["Class Intervals"].apply(
            lambda x: (
                ( float(x.split(' ')[0]) + float(x.split(' ')[-1]) ) / 2
            )
        )

        ## relative frequency
        df["Relative Frequency"] = df["Frequency"] / df["Frequency"].sum()

        ## cumulative frequency
        df["Cumulative Frequency"] = df["Frequency"].cumsum()

    return df

## data
test_scores = np.array([
    52, 92, 84, 74, 65, 55, 78, 95, 62, 
    72, 64, 74, 82, 94, 71, 79, 73, 94, 
    77, 53, 77, 87, 97, 57, 72, 89, 76, 
    91, 86, 99, 71, 73, 58, 76, 33, 78, 69
])

## without specifying user input
make_frequency_distribution(test_scores)
Start value: 33
End value: 99
Range: 66
Total Number of Classes: 6
Class Width = 11.0

Class Intervals Frequency Class Midpoint Relative Frequency Cumulative Frequency
0 33.0 - under 44.0 1 38.5 0.027027 1
1 44.0 - under 55.0 2 49.5 0.054054 3
2 55.0 - under 66.0 6 60.5 0.162162 9
3 66.0 - under 77.0 11 71.5 0.297297 20
4 77.0 - under 88.0 9 82.5 0.243243 29
5 88.0 - under 99.0 8 93.5 0.216216 37

## with specifying user input (a different frequency diistribution for the same data)
start_value = 30
end_value = 100
total_classes = 7

## make a tuple
user_input = (
    start_value, end_value, total_classes
)

## make the distribution
make_frequency_distribution(test_scores, user_input)
Start value: 30
End value: 100
Range: 70
Total Number of Classes: 7
Class Width = 10.0

Class Intervals Frequency Class Midpoint Relative Frequency Cumulative Frequency
0 30.0 - under 40.0 1 35.0 0.027027 1
1 40.0 - under 50.0 0 45.0 0.000000 1
2 50.0 - under 60.0 5 55.0 0.135135 6
3 60.0 - under 70.0 4 65.0 0.108108 10
4 70.0 - under 80.0 15 75.0 0.405405 25
5 80.0 - under 90.0 5 85.0 0.135135 30
6 90.0 - under 100.0 7 95.0 0.189189 37

Questionnaire

Ques 01: The following data are the average weekly mortgage interest rates for a 40-week period.

   7.29    6.69    6.90    7.03    7.28    7.17    7.40    6.97

   7.23    6.77    7.16    6.90    7.31    6.78    6.35    6.96

   7.11    6.57    7.30    7.16    6.87    7.08    6.96    7.02

   6.78    6.80    7.24    7.40    7.68    7.12    7.29    7.13

   7.47    6.88    7.16    7.05    7.03    7.31    7.16    6.84

Construct a frequency distribution for these data. Calculate and display the class midpoints, relative frequencies, and cumulative frequencies for this frequency distribution.

Ques 02: The following data represent the afternoon high temperatures for 50 construction days during a year in St. Louis.

  42  55  16  38  31   70  84  40  79  38

  64  10  81  35  52   47  24  15  36  16

  66  45  35  23  81   69  31  17  64  12

  73  62  40  75  61   38  47  36  53  43

  48  63  44  31  30   25  84  17  60  33

  2.1. Construct a frequency distribution for the data using five class intervals.

  2.2. Construct a frequency distribution for the data using 10 class intervals.

  2.3. Examine the results of (2.1) and (2.2) and comment on the usefulness of the frequency distribution in terms of temperature summarizationcapability.

Ques 03: A packaging process is supposed to fill small boxes of raisins with approximately 50 raisins so that each box will weigh the same. However, the number of raisins in each box will vary. Suppose 100 boxes of raisins are randomly sampled, the raisins counted, and the following data are obtained.

  57  44  49  49  51  54  55  46  59  47  51  53

  49  52  48  46  53  59  53  52  53  45  44  49

  55  51  50  57  45  48  52  57  54  54  53  48

  47  47  45  50  50  39  46  57  55  53  57  61

  56  45  60  53  52  52  47  56  49  60  40  56

  51  58  55  52  53  48  43  49  46  47  51  47

  54  53  43  47  58  53  49  47  52  51  47  49

  48  49  52  41  50  48  52  48  53  47  46  57

  44  48  57  46

Construct a frequency distribution for these data. What does the frequency distribution reveal about the box fills?

Ques 04: The owner of a fast-food restaurant ascertains the ages of a sample of customers. From these data, the owner constructs the frequency distribution shown.

Class Interval Frequency
0 - under 5 6
5 - under 10 8
10 - under 15 17
15 - under 20 23
20 - under 25 18
25 - under 30 10
30 - under 35 4

For each class interval of the frequency distribution, determine the the relative frequency, and what does the relative frequency tell the fast-food restaurant owner about customer ages?

2. Notes are compiled from Wikipedia and Business Statistics by Ken Black

3. If you face any problem or have any feedback/suggestions feel free to comment.