## required packages/modules

import random
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from IPython.display import display, HTML

## format output
CSS = """
.output {
  margin-left:100px;
}
"""

HTML('<style>{}</style>'.format(CSS))

Data, Information and Knowledge

Data

Data is a collection of text, numbers and symbols with no meaning.
It, therefore, has to be processed or provided with a context to make it meaningful.
Example:
- 161.2, 175.3, 166.4, 164.7, 169.3 (units in cm).
- Cat, dog, gerbil, rabbit, cockatoo.

Information

Information is the result of processing data. It enables the processed data to be used in a context and have a meaning.
In simpler words, information is data that has meaning.
If we put information into an equation with data, it will look like this:

Data + Meaning = Information
Example:
- 161.2, 175.3, 166.4, 164.7, 169.3 are the heights of the five tallest 15-year-old students in a class.
- Cat, dog, gerbil, rabbit is a list of household pets.

Knowledge

Knowledge is the state of knowing/learning something through the given information.
If we put knowledge into an equation with information, it will look like this:

Information + Application or Use = Knowledge
Example:
- The tallest student is 175.3 cm.
- A lion is not a household pet as it is not in the list, and it lives in the wild.

Conclusion

Data is a collection of facts. Information is how you understand those facts in context. Knowledge is learning something from the given information.

Grouped and Ungrouped Data

We often describe data as grouped and ungrouped.
Raw data or data that have not been summarised in any way are called ungrouped data.
Example of ungrouped data: The below table contains test scores of a class.
- Test scores: [85, 37, 56, 87, 75, 67, 58, 62, 77, 94, 95, 88]
Data that have been organized into a frequency distribution are called grouped data.
Example of grouped data: Frequency Distribution of 60 Years of Unemployment Data.

pd.DataFrame(
    {
        "Test Scores": [
            "90—99", "80—89", "70—79", "60—69", "50—59", "40—49", "30—39", "Total"
        ],
        
        "Number of Students(Frequency)": [
            7, 5, 15, 4, 5, 0, 1, 37
        ]
    }
)

Individuals and Variables

Defining The Terms

Individuals are people or objects included in a study.
- e.g. five individuals could be five people, five records or five reports.

A variable is a characteristic of the individual to be measured or observed.
- e.g. age, time etc.

Example

Example: Millions of Americans rely on caffeine to get them up in the morning. The data below shows nutritional content of some popular drinks at Ben's Beans coffee shop.

## all imports are present at the start of the notebook

df = pd.DataFrame(
    {
        "Drink": ["Brewed Coffee", "Caffe Latte", "Caffe Mocha", "Cappuccino", "Iced Brewed Coffee", "Chai Latte"],
        "Type": ["Hot", "Hot", "Hot", "Hot", "Cold", "Hot"],
        "Calories": [4, 100, 170, 60, 60, 120],
        "Sugar (g)": [0, 14, 27, 8, 15, 25],
        "Caffeine (mg)": [260, 75, 95, 75, 120, 60]
    }
)

Solution:
- Individuals in the data set: All Ben's Beans drinks.
- Variables in the data set: Type, Calories, Sugar(g), Cafeine(mg)

Population and Sample

Defining The Terms

A population is an entire group that we want to conclude.
A sample is a small portion of the population.
In research, a population doesn't always refer to people. It can mean a group containing an element of anything we want to study, such as objects, events, organisations, countries, species, organisms etc.

## all imports are present at the start of the notebook

def get_coord():
    """
    Function to get random coordinates.
    
    Returns:
        tuples: containing two list (x-coordinates, y-coordinates).
    """
    # radius of the circle
    circle_r = 5

    # center of the circle (x, y)
    circle_x = 5
    circle_y = 5

    ## x and y coord list
    x_coord, y_coord = [], []
    
    ## iterate for 100 times
    for i in range(100):
        # random angle
        alpha = 2 * math.pi * random.random()

        # random radius
        r = circle_r * math.sqrt(random.random())

        # calculating coordinates
        x = r * math.cos(alpha) + circle_x
        y = r * math.sin(alpha) + circle_y
        
        ## append the results
        x_coord.append(x)
        y_coord.append(y)
    
    return x_coord, y_coord

## uncomment it
## x_coord, y_coord = get_coord()

## create subplots
fig, ax = plt.subplots(figsize=(12,8), facecolor="#121212")
ax.set_facecolor("#121212")

## set axis limit
ax.set(xlim=(-0.5,10.5), ylim=(-3.5,14))

## scatter points using random generated coordinates
ax.scatter(x_coord, y_coord, s=100, facecolors='none', edgecolors="#CF6679", hatch=5*'/')

## add 1st ellipse --> for population
circle = plt.Circle((5, 5), radius=5.2, ec="#F2F2F2", fc="none")
ax.add_artist(circle)

## add 2nd ellipse --> for sample
circle_2 = plt.Circle((6.5, 4), radius=2.1, ec="#F2F2F2", fc="none")
ax.add_artist(circle_2)

## style for arrows
style = "Simple, tail_width=0.5, head_width=4, head_length=10"
kw = dict(arrowstyle=style, color="#D3D3D3")

## plot arrow for population
a1 = patches.FancyArrowPatch((2,11), (3,8), connectionstyle="arc3,rad=.7", **kw)
ax.add_patch(a1)

## plot arrow for sample
a1 = patches.FancyArrowPatch((6.5,-1), (6,3), connectionstyle="arc3,rad=-.7", **kw)
ax.add_patch(a1)

## text for population and sample
textstr_1 = "Population: all the dots inside the big elipse."
textstr_2 = "Sample: all the dots inside\n              the small elipse."

## props for text
props = dict(boxstyle='round', facecolor='none', edgecolor="#D3D3D3", alpha=0.8)

## place text for population
ax.text(
    2, 11.5, textstr_1, color="#F2F2F2", size=12,
    bbox=dict(facecolor="none", edgecolor="#D3D3D3", boxstyle="round,pad=1"), zorder=2
)

## place text for sample
ax.text(
    6.67, -1.5, textstr_2, color="#F2F2F2", size=12,
    bbox=dict(facecolor="none", edgecolor="#D3D3D3", boxstyle="round,pad=1"), zorder=2
)

## credits
ax.text(
    10, -3.45, "graphic: @slothfulwave612", fontsize=10, fontstyle="italic",
    ha="right", va="center", color="#D3D3D3"
)

## tidy axis
ax.axis("off")

## show the plot
plt.show()

Collecting data from a population

Populations are used when we have access to data from every member of the population.
Usually, it is only straightforward to collect data from a whole population when it is small, accessible and cooperative.

Example: Collecting data from a population
- A high school administrator wants to analyze the final exam scores of all graduating seniors to see if there is a trend. Since they are only interested in applying their findings to the graduating seniors in high school, they use the whole population dataset.

For larger and more dispersed populations, it is often difficult or impossible to collect data from every individual.
- For example, every ten years, the Indian government aims to count every person living in the country using the Indian Census. This data is used in many ways. But it is always difficult to collect data from the whole population. Because of non-responses, the population count is incomplete, and the data can be biased towards some groups.
In cases like this, sampling can be used to make more precise inferences about the population.

Collecting data from a sample

When our population is large, geographically dispersed, or difficult to contact, it is necessary to use a sample.
We can use sample data to make estimates or test hypotheses about population data.
- A hypothesis is a precise, testable statement of what the researcher(s) predict will be the outcome of the study.
- We will be learning more about hypothesis testing in future tutorials.

Example: Collecting data from a sample
- We want to study political attitudes in young people. Our population is the 300,000 undergraduate students in India. Because it’s not practical to collect data from all of them, we use a sample of 1000 undergraduate volunteers from three Indian universities – this is the group who will complete our online survey.
Ideally, a sample should be randomly selected and representative of the population.

Reasons for sampling

Necessity: Sometimes it’s simply not possible to study the whole population due to its size or inaccessibility.
Practicality: It’s easier and more efficient to collect data from a sample.
Cost-effectiveness: There are a fewer participant, laboratory, equipment, and researcher costs involved.
Manageability: Storing and running statistical analyses on smaller datasets is easier and reliable.

Population Parameter and Sample Statistics

Defining The Terms

When we collect data from a population or a sample, there are various measurements and numbers we can calculate from the data.
A parameter is a measure that describes the whole population.
A statistic is a measure that describes the sample.

Example

In our study of students’ political attitudes, we ask our survey participants to rate themselves on a scale from 1, very liberal, to 10, very conservative.
We find that most of our sample identifies as liberal – the mean rating on the political attitudes scale is 3.2.
We can use this statistic, the sample mean of 3.2, to make a scientific guess about the population parameter – that is, to infer the mean political attitude rating of all undergraduate students in India.

Notations

Population parameters are usually denoted by Greek letters.
- Examples of population parameters are population mean ($\mu$), population variance ($\sigma^{2}$), and population standard deviation ($\sigma$).
Sample statistics are usually denoted by Roman letters.
- Example of sample statistics are sample mean ($\overline{x}$), sample variance ($s^{2}$), and sample standard deviation ($s$).

Questionnaire

Ques 01. Write an example showing the difference between data, information and knowledge?

Ques 02. 5, 10, 15, 20 are items of data. Explain how these could become information and what knowledge could be gained from them?

Ques 03. A report analysed a dataset of 77 breakfast cereals. Here is a part of the dataset.

pd.DataFrame(
    {
        "Name": [
            "100% Bran", "100% Natural Bran", "All-Bran", "All-Bran Extra Fiber",
            "Almond Delight", "Apple Cinnamon Cheerios", "Apple Jacks"
        ],
        "Manufacturer": [
            "Nabisco", "Quaker Oats", "Kelloggs", "Kelloggs", "Ralston Purina", 
            "General Mills", "Kelloggs"
        ],
        "Calories": [
            70, 120, 70, 50, 110, 110, 110
        ],
        "Sodium":[
            130, 15, 260, 140, 200, 180, 125
        ],
        "Fat": [
            1, 5, 1, 0, 2, 2, 0
        ]
    }
)

3.1. Who are the individuals described in this data?

3.2. List out the variables in the data?

Ques 04. A market researcher surveys 85 people on their coffee-drinking habits. She wants to know whether people in the local region are willing to switch their regular drink to something new. What is the sample?

Ques 05. The market researcher analyzes the data and finds that 61% of survey respondents are willing to switch their regular drink to something new. What is the 61% referred to as?

Ques 06. Administrators at Riverview High School surveyed a random sample of 100 of their seniors to see how they felt about the lunch offerings at the school's cafeteria. Identify the population and sample in this setting.

Ques 07. A safety inspector conducts air quality tests on a randomly selected group of 7 classrooms at an elementary school. Identify the population and sample in this setting.

Ques 08. The state Department of Transportation wants to know about out-of-state vehicles that pass over a toll bridge with several lanes. A camera installed over one lane of the bridge photographs the license plate of every tenth vehicle that passes through that lane. Identify the population and sample in this setting.

Ques 09. A pediatrician randomly selected 10 parents of his patients. Then he surveyed the parents about their opinions of different kinds of diapers. Identify the population and sample in this setting.

Ques 10. A factory overseer selects 40 threaded rods at random from those produced that week at the factory, then she tests their tensile strength. Identify the population and sample in this setting.

Ques 11. Why do we need sampling?

1. Solutions ↩

2. Notes are compiled from freeCodeCamp.org, Khan Academy and results from Google ↩

3. If you face any problem or have any feedback/suggestions feel free to comment.↩

	Test Scores	Number of Students(Frequency)
0	90—99	7
1	80—89	5
2	70—79	15
3	60—69	4
4	50—59	5
5	40—49	0
6	30—39	1
7	Total	37

	Drink	Type	Calories	Sugar (g)	Caffeine (mg)
0	Brewed Coffee	Hot	4	0	260
1	Caffe Latte	Hot	100	14	75
2	Caffe Mocha	Hot	170	27	95
3	Cappuccino	Hot	60	8	75
4	Iced Brewed Coffee	Cold	60	15	120
5	Chai Latte	Hot	120	25	60

	Name	Manufacturer	Calories	Sodium	Fat
0	100% Bran	Nabisco	70	130	1
1	100% Natural Bran	Quaker Oats	120	15	5
2	All-Bran	Kelloggs	70	260	1
3	All-Bran Extra Fiber	Kelloggs	50	140	0
4	Almond Delight	Ralston Purina	110	200	2
5	Apple Cinnamon Cheerios	General Mills	110	180	2
6	Apple Jacks	Kelloggs	110	125	0