Spring 2020

WHAT IS STATISTICS?

  • “The practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.” - Oxford Dictionary
  • “Statistics is the discipline that concerns the collection, organization, displaying, analysis, interpretation and presentation of data.” - Wikipedia
  • “Statistics is an important tool in the data analysis/science textbook. Statistics provides a coherent framework for thinking about random variation, and tools to partition data into signal and noise.” - Hadley Wickham

ANDERSON’S IRIS DATA

American botonist Edgar Anderson collected length and width measurements of sepals and petals for three related Iris species to quantify their morphologic (structural feature) variation. Can we use these measurements to distinguish between related flower species? What variables do we need? What do the observations of these variables look like? How can we organize this is information in a meaningful way?

https://www.britannica.com/science/sepal/media/1/534976/111702

UK LUNG DISEASE DEATHS, 1974-1979

From 1974-1979, monthly deaths from bronchitis, emphysema, and asthma in the UK were recorded. Total deaths were recorded, as well as male and female deaths separately. What variables are present in this data set? What do the observations of these variables look like? How can we organize this information in a meaningful way?

DATA BASICS

TYPES OF VARIABLES

Note:
  • Regular categorical variables are sometimes called nominal.
  • We will not focus on ordinal categorical variables in this class.

TYPES OF VARIABLES

Numerical: can take on a wide range of values; average, sum, and difference of observations of numerical variables should have a clear meaning

  • Continuous: take on numerical values without jumps; measurable but not restricted to taking on certain specified values

  • Discrete: can only take on only specified values that differ by fixed amounts; cannot take on any intermediate values, often integers or counts

Categorical: observations fall into categories

TYPES OF VARIABLES

  • Petal length:
  • Sepal length:
  • Species:

TYPES OF VARIABLES

  • Petal length: numerical, continuous
  • Sepal length: numerical, continuous
  • Species:

TYPES OF VARIABLES

  • Petal length: numerical, continuous
  • Sepal length: numerical, continuous
  • Species: categorical (nominal)

TYPES OF VARIABLES

  • Time:
  • Total Deaths:

TYPES OF VARIABLES

  • Time: numerical, discrete (measured at specific intervals)
  • Total Deaths: numerical, discrete

PRACTICE

What type of variable is a telephone area code?

  1. numerical, continuous
  2. numerical, discrete
  3. categorical
  4. categorical, ordinal

PRACTICE

What type of variable is a telephone area code?

  1. numerical, continuous
  2. numerical, discrete
  3. categorical
  4. categorical, ordinal

RELATIONSHIPS BETWEEN VARIABLES

RELATIONSHIPS BETWEEN VARIABLES

  • Two variables that are connected to each other are associated. Associated variables are also referred to as dependent.
    • If observations of one variable increase as observations of the second variable increase, the two variables have a positive association (Plot 2).
    • If observations of one variable increase as observations of the second variable decrease, the two variables have a negative association (Plot 3).
  • Two variables are independent if they are not associated (Plot 1).

EXPLANATORY & RESPONSE VARIABLES

In a pair of variables, we want to identify the explanatory variable and the response variable. The explanatory variable is suspected of affecting the response variable.

explanatory variable \(\xrightarrow{\text{might affect}}\) response variable


Note: These labels are only appropriate when there is a hypothesized relationship between two variables (i.e. one variable affects the other). Labeling variables as explanatory and response does not guarantee a causal relationship.

ASSOCIATION \(\neq\) CAUSATION

Even when there is a hypothesized relationship between two variables, this does not mean that the explanatory variable causes the response.

Association does not imply causation. Causation can only be inferred from a randomized experiment.

SAMPLING PRINCIPLES AND STRATEGIES

POPULATIONS & SAMPLES

Finding Your Ideal Running Form

Research question: Can people become better, more efficient runners on their own, simply by running?

Population of interest: All people

Sample: Group of adult women who recently joined a running group

Population to which results can be generalized: Adult women, if the data are randomly sampled

ANECDOTAL EVIDENCE

Consider the following news article: According to 104-year-old woman, Dr. Pepper is the secret to longevity.

Question: Should we drink Dr. Pepper every day? Are there problems with drawing conclusions based on this kind of information?

This is an example of anecdotal evidence, which is collected in a haphazard fashion. It may be true and verifiable, but it may only represent extraordinary cases. In fact, we often remember anecdotal evidence precisely because it is unusual.

EXPLORATORY ANALYSIS TO INFERENCE

  • Sampling is natural. Think about sampling soup you are cooking. You taste (sample) a small part of the soup to get an idea about the dish as a whole.
  • When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, that is exploratory analysis.
  • If you generalize and conclude that your entire soup needs salt, that’s an inference.
  • For your inference to be valid, the spoonful (sample) you tasted needs to be representative of the entire pot (population).

EXPLORATORY ANALYSIS TO INFERENCE

Suppose you taste a spoonful of soup from the top of the pot and you think it isn’t salty enough. Should you infer that the soup as a whole isn’t salty enough? In other words, is the sample (spoonful) representative of the population (pot of soup)?

  • No!

Now suppose that you stir the soup thoroughly before you taste a spoonful of it. Now is the sample representative of the population?

  • Yes!

SAMPLING BIAS

Suppose 1,000 restaurant patrons are randomly selected (from the population of patrons) for a survey about their experiences, but only 358 people respond. Do you think these respondants are representative of the population?

  • People with really positive or negative experiences are more likely to respond.

Non-response: If only a small fraction of the randomly sampled people choose to respond to a survey, the responses may not be representative of the population.

SAMPLING BIAS

Suppose I want to know how Americans feel about gun control. Rather than randomly select individuals from my population of interest to participate in my investigation, I get a host of a local talk show to ask about gun control on her radio program. Who do you think will respond? Do you think these respondants are representative of the population?

Voluntary response: Occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue. Such a sample will not be representative of the population.

SAMPLING BIAS

Suppose a homework assignment for STAT 140 was to conduct a survey to determine current Mount Holyoke students’ voting preferences in the upcoming presidential primary races. Student A had a very busy week and forgot to do this assignment until one hour before class. Wanting to complete the assignment, however, they decide to stand outside of Blanchard and ask anyone who passes what their candidate preference is. What is the flaw with this sampling strategy?

Convenience sample: Individuals who are easily accessible are more likely to be included in the sample.

CENSUS

Why do we sample from a population? Wouldn’t it be preferable to include everyone from the entire population, i.e. “sample” everyone? This is called a census.

There are problems with taking a census:

  • It can be difficult to complete a census - there are always some individuals that are hard to locate or measure, and these individuals may have characteristics that distinguish them from the rest of the population.
  • Populations change constantly, so it’s never possible to get a perfect measure, even with a census.
  • Taking a census can be harder than sampling.

CENSUS

OBSERVATIONAL STUDIES AND EXPERIMENTS

When researchers collect data in a way that does not directly interfere with how the data arise, this is called an observational study. Researchers observe, rather than applying some treatment, to the sample of interest and can only establish an association between the explanatory and response variables.

An experiment arises when researchers randomly assign subjects to various treatments in order to establish causal connections between explanatory and response variables.

OBSERVATIONAL STUDIES

Observational studies come in two forms:

Note: some data sets may have both prospectively- and retrospectively-collected variables.

SAMPLING METHODS

  • Simple random sampling: sample n individuals from a population, where each individual has an equal chance of being included and knowing that one individual is included doesn’t provide any information about what other individuals are included
  • Stratified random sampling: divide population into strata, where each stratum is composed of similar individuals, then sample from within each stratum (usually using simple random sampling)
  • Cluster sampling: break population into groups, called clusters, then sample a fixed number of clusters and include all individuals in those clusters
  • Multistage random sampling: like cluster sampling, but we collect a random sample within each cluster rather than keeping all individuals in the cluster

EXPERIMENTS

TERMINOLOGY

  • Experiment: study where researchers assign treatments to experimental units
  • Randomized experiment: when the assignment includes randomization; fundamentally important when trying to demonstrate causal connection between two variables

PRINCIPLES OF EXPERIMENTAL DESIGN

  1. Control: compare treatment of interest to a control group
  2. Randomize: randomly assign subjects to treatments, and randomly sample from the population whenever possible
  3. Replicate: Within a study, replicate by collecting a sufficiently large sample. Or replicate the entire study.
  4. Block: If there are variables that are known or suspected to affect the response variable, first group subjects into blocks based on these variables, and then randomize cases within each block to treatment groups.

BLOCKING

We would like to design an experiment to see if eating breakfast improves student performance in STAT 140 classes at Mount Holyoke College during the Fall 2019 semester.

  • Treatment: breakfast
  • Control: no breakfast

We expect that eating breakfast may have a different effect if the class is in the morning (before 12 PM) or in the afternoon (after 12 PM):

  • Divide sample into morning and afternoon students
  • Randomly assign morning students to treatment and control groups
  • Randomly assign afternoon students to treatment and control groups
  • Morning and afternoon students are equally represented in the treatment and control groups

BLOCKING

STAT 140 Breakfast Example:

Are there other blocking variables besides time of day that we should have considered?

  • Instructor
  • Class

PRACTICE

A study is designed to test the effect of light level and noise level on exam performance of students. The researcher also believes that light and noise levels may have different effects on males and females, so both sexes need to be equally represented in each group. Which of the following is correct?

  • There are 3 explanatory variables (light, noise, sex) and 1 response variable (exam performance).
  • There are 2 explanatory variables (light, noise), 1 blocking variable (sex), and 1 response variable (exam performance).
  • There is 1 explanatory variable (gender) and 3 response variables (light, noise, exam performance).
  • There are 2 blocking variables (light and noise), 1 explanatory variable (gender), and 1 response variable (exam performance).

PRACTICE

A study is designed to test the effect of light level and noise level on exam performance of students. The researcher also believes that light and noise levels may have different effects on males and females, so both sexes need to be equally represented in each group. Which of the following is correct?

  • There are 3 explanatory variables (light, noise, sex) and 1 response variable (exam performance).
  • There are 2 explanatory variables (light, noise), 1 blocking variable (sex), and 1 response variable (exam performance).
  • There is 1 explanatory variable (gender) and 3 response variables (light, noise, exam performance).
  • There are 2 blocking variables (light and noise), 1 explanatory variable (gender), and 1 response variable (exam performance).

PRACTICE

What is the main difference between experiments and observational studies?

  • Experiments take place in a lab while observational studies do not need to.
  • In an observational study, we only look at what happened in the past.
  • Most experiments use random assignment while observational studies do not.
  • Observational studies are completely uselss since no causal inference can be made based on their findings.

PRACTICE

What is the main difference between experiments and observational studies?

  • Experiments take place in a lab while observational studies do not need to.
  • In an observational study, we only look at what happened in the past.
  • Most experiments use random assignment while observational studies do not.
  • Observational studies are completely uselss since no causal inference can be made based on their findings.

REDUCING BIAS IN HUMAN EXPERIMENTS

  • Placebo: fake treatment, often used as the control group in medical studies
  • Placebo effect: experiemental units show improvement simply because they believe they are receiving a special treatment
  • Blind: when experimental units do not know whether they are in the control or treatment group
  • Double blind: when neither the experimental units nor the researchers who interact with the patients know who is in the control and who is the treatment group(s)

REFERENCES

  • Diez et al. (2019) OpenIntro Statistics, Fourth Edition
  • Pagano and Gauvreau (2000) Principles of Biostatistics, Second Edition