STAT 140

SPRING 2020

Fitting a line, residuals, and correlation

MODELING NUMERICAL VARIABLES

In these notes, we will focus on quantifying the relationship between a

numerical response variable and
a numerical explanatory variable

POVERTY VERSUS HIGH SCHOOL GRADUATE RATE

The scatterplot below shows the relationship between HS graduate rate in all 3083 US counties and the percent of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

POVERTY VERSUS HIGH SCHOOL GRADUATE RATE

Response variable:
Explanatory variable:
Relationship:

POVERTY VERSUS HIGH SCHOOL GRADUATE RATE

Response variable: percent in poverty
Explanatory variable: percent high school grad
Relationship: linear, negative, moderately strong

POVERTY VERSUS HIGH SCHOOL GRADUATE RATE

The linear model for predicting poverty from high school graduation rate in the US is \[\widehat{poverty}=64.594-0.591\times hs\_grad\] The “hat” is used to signify this is an estimate.

The high school graduate rate in Hampshire County, MA is 92.4%. What poverty level does the model predict for this county?

64.594-0.591*92.4=10.009

WHICH LINE IS IT ANYWAY?

RESIDUALS

Residuals are the leftovers from the model fit:

\[Data = Fit + Residual\]

Residual is the difference between the observed ($y_i$) and predicted $\hat{y}_i$.

\[\underbrace{\hat{e}_i}_{residual}=\underbrace{y_i}_{data}-\underbrace{\hat{y}_i}_{fit}\]

Hampshire County, MA: residual

1.691

QUANTIFYING THE RELATIONSHIP

Correlation describes the strength of the linear association between two variables.
It takes values between -1 (perfect negative) and +1 (perfect positive).
A value of 0 indicates no linear association.

GUESSING THE CORRELATION

0.70
-1.40
-0.68
0.02
-0.30

GUESSING THE CORRELATION

0.70
-1.40
-0.68
0.02
-0.30

ASSESSING THE CORRELATION

Which of the following is has the strongest correlation, i.e. correlation coefficient closest to +1 or -1?

ASSESSING THE CORRELATION

Which of the following is has the strongest correlation, i.e. correlation coefficient closest to +1 or -1? (b) correlation - linear association

Least squares regression

WHICH LINE IS THE BEST LINE?

We want a line that has small residuals

Option 1: Minimize the sum of magnitudes (absolute values) of residuals \[|e_1| + |e_2| + … + |e_n|\]
Option 2: Minimize the sum of squared residuals – least squares \[e_1^2 + e_2^2 + … + e_n^2\]

WHY LEAST SQUARES?

Most commonly used
Easier to compute by hand and using software
In many applications, a residual twice as large as another is usually more than twice as bad

THE LEAST SQUARES LINE

\[\underbrace{\hat{y}}_{predicted \ y}=\underbrace{\hat{\beta}_0}_{est. \ intercept}+\underbrace{\hat{\beta_1}}_{est. \ slope}\underbrace{x}_{explanatory \ variable}\] Intercept:

Parameter: $\beta_0$
Point estimate: $\hat{\beta}_0$

Slope:

Parameter: $\beta_1$
Point estimate: $\hat{\beta}_1$

CONDITIONS FOR THE LEAST SQUARES LINE

Linearity
Nearly normal residuals
Constant variability

CONDITIONS: (1) LINEARITY

The relationship between the response and explanatory variable should be linear .
Methods to fit nonlinear relationships exist - see advanced statistics classes for this.
To assess whether this condition is met, check the residual plot:

CONDITIONS: (2) NEARLY NORMAL RESIDUALS

The residuals should be nearly normal.
This condition may not be satisfied when there are unusual observations that don’t follow the trend of the rest of the data.
To assess whether this condition is met, check using a histogram or normal probability (qq)plot of residuals.

CONDITIONS: (3) CONSTANT VARIABILITY

The variability of points around the least squares line should be roughly constant.
This implies that the variability of residuals around the 0 line should be roughly constant as well.
Also called homoscedasticity.
To assess whether this condition is met, check using a residual plot.

SLOPE

The slope of the regression line can be calculated as

\[\hat{\beta}_1=\frac{s_y}{s_x}R\] where $s_y$ and $s_x$ are the standard deviations associated with $y$ and $x$ and $R$ is the correlation.

INTERCEPT

The intercept is where the regression line intersects the y-axis. The calculation of the intercept uses the fact the a regression line always passes through ($\bar{x}$, $\bar{y}$).

\[\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}\]

INTERPRETATION OF SLOPE AND INTERCEPT

Intercept: When x = 0, y is expected to equal the intercept.
Slope: For each unit in x, y is expected to increase/decrease on average by the slope.

These statements are not causal unless the study is a randomized controlled experiment.

PREDICTION

Using the linear model to predict the value of the response variable for a given value of the explanatory variable is called prediction, simply by plugging in the value of x in the linear model equation.
There will be some uncertainty associated with the predicted value.

EXTRAPOLATION

Applying a model estimate to values outside of the realm of the original data is called extrapolation.
Sometimes the intercept might be an extrapolation.

$R^2$

The strength of the fit of a linear model is most commonly evaluated using $R^2$.
$R^2$ is calculated as the square of the correlation coefficient.
It tells us what percent of variability in the response variable is explained by the model.
The remainder of the variability is explained by variables not included in the model or by inherent randomness in the data.

REFERENCES

Diez et al. (2019) OpenIntro Statistics, Fourth Edition

Fitting a line, residuals, and correlation

MODELING NUMERICAL VARIABLES

POVERTY VERSUS HIGH SCHOOL GRADUATE RATE

POVERTY VERSUS HIGH SCHOOL GRADUATE RATE

POVERTY VERSUS HIGH SCHOOL GRADUATE RATE

POVERTY VERSUS HIGH SCHOOL GRADUATE RATE

WHICH LINE IS IT ANYWAY?

RESIDUALS

QUANTIFYING THE RELATIONSHIP

GUESSING THE CORRELATION

GUESSING THE CORRELATION

ASSESSING THE CORRELATION

ASSESSING THE CORRELATION

Least squares regression

WHICH LINE IS THE BEST LINE?

WHY LEAST SQUARES?

THE LEAST SQUARES LINE

CONDITIONS FOR THE LEAST SQUARES LINE

CONDITIONS: (1) LINEARITY

CONDITIONS: (2) NEARLY NORMAL RESIDUALS

CONDITIONS: (3) CONSTANT VARIABILITY

SLOPE

INTERCEPT

INTERPRETATION OF SLOPE AND INTERCEPT

PREDICTION

EXTRAPOLATION

\(R^2\)

REFERENCES