SPRING 2020

Fitting a line, residuals, and correlation

MODELING NUMERICAL VARIABLES

In these notes, we will focus on quantifying the relationship between a

  • numerical response variable and

  • a numerical explanatory variable

POVERTY VERSUS HIGH SCHOOL GRADUATE RATE

The scatterplot below shows the relationship between HS graduate rate in all 3083 US counties and the percent of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

POVERTY VERSUS HIGH SCHOOL GRADUATE RATE

  • Response variable:
  • Explanatory variable:
  • Relationship:

POVERTY VERSUS HIGH SCHOOL GRADUATE RATE

  • Response variable: percent in poverty
  • Explanatory variable: percent high school grad
  • Relationship: linear, negative, moderately strong

POVERTY VERSUS HIGH SCHOOL GRADUATE RATE

The linear model for predicting poverty from high school graduation rate in the US is \[\widehat{poverty}=64.594-0.591\times hs\_grad\] The “hat” is used to signify this is an estimate.

The high school graduate rate in Hampshire County, MA is 92.4%. What poverty level does the model predict for this county?

64.594-0.591*92.4=10.009

WHICH LINE IS IT ANYWAY?

RESIDUALS

Residuals are the leftovers from the model fit:

\[Data = Fit + Residual\]

Residual is the difference between the observed (\(y_i\)) and predicted \(\hat{y}_i\).

\[\underbrace{\hat{e}_i}_{residual}=\underbrace{y_i}_{data}-\underbrace{\hat{y}_i}_{fit}\]

  • Hampshire County, MA: residual

1.691

QUANTIFYING THE RELATIONSHIP

  • Correlation describes the strength of the linear association between two variables.

  • It takes values between -1 (perfect negative) and +1 (perfect positive).

  • A value of 0 indicates no linear association.

GUESSING THE CORRELATION

  1. 0.70
  2. -1.40
  3. -0.68
  4. 0.02
  5. -0.30

GUESSING THE CORRELATION

  1. 0.70
  2. -1.40
  3. -0.68
  4. 0.02
  5. -0.30

ASSESSING THE CORRELATION

Which of the following is has the strongest correlation, i.e. correlation coefficient closest to +1 or -1?

ASSESSING THE CORRELATION

Which of the following is has the strongest correlation, i.e. correlation coefficient closest to +1 or -1? (b) correlation - linear association

Least squares regression

WHICH LINE IS THE BEST LINE?

We want a line that has small residuals

  1. Option 1: Minimize the sum of magnitudes (absolute values) of residuals \[|e_1| + |e_2| + … + |e_n|\]
  2. Option 2: Minimize the sum of squared residuals – least squares \[e_1^2 + e_2^2 + … + e_n^2\]

WHY LEAST SQUARES?

  1. Most commonly used
  2. Easier to compute by hand and using software
  3. In many applications, a residual twice as large as another is usually more than twice as bad

THE LEAST SQUARES LINE

\[\underbrace{\hat{y}}_{predicted \ y}=\underbrace{\hat{\beta}_0}_{est. \ intercept}+\underbrace{\hat{\beta_1}}_{est. \ slope}\underbrace{x}_{explanatory \ variable}\] Intercept:

  • Parameter: \(\beta_0\)
  • Point estimate: \(\hat{\beta}_0\)

Slope:

  • Parameter: \(\beta_1\)
  • Point estimate: \(\hat{\beta}_1\)

CONDITIONS FOR THE LEAST SQUARES LINE

  1. Linearity
  2. Nearly normal residuals
  3. Constant variability

CONDITIONS: (1) LINEARITY

  • The relationship between the response and explanatory variable should be linear .

  • Methods to fit nonlinear relationships exist - see advanced statistics classes for this.

  • To assess whether this condition is met, check the residual plot:

CONDITIONS: (2) NEARLY NORMAL RESIDUALS

  • The residuals should be nearly normal.
  • This condition may not be satisfied when there are unusual observations that don’t follow the trend of the rest of the data.
  • To assess whether this condition is met, check using a histogram or normal probability (qq)plot of residuals.

CONDITIONS: (3) CONSTANT VARIABILITY

  • The variability of points around the least squares line should be roughly constant.
  • This implies that the variability of residuals around the 0 line should be roughly constant as well.
  • Also called homoscedasticity.
  • To assess whether this condition is met, check using a residual plot.

SLOPE

The slope of the regression line can be calculated as

\[\hat{\beta}_1=\frac{s_y}{s_x}R\] where \(s_y\) and \(s_x\) are the standard deviations associated with \(y\) and \(x\) and \(R\) is the correlation.

INTERCEPT

The intercept is where the regression line intersects the y-axis. The calculation of the intercept uses the fact the a regression line always passes through (\(\bar{x}\), \(\bar{y}\)).

\[\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}\]

INTERPRETATION OF SLOPE AND INTERCEPT

  • Intercept: When x = 0, y is expected to equal the intercept.

  • Slope: For each unit in x, y is expected to increase/decrease on average by the slope.

These statements are not causal unless the study is a randomized controlled experiment.

PREDICTION

  • Using the linear model to predict the value of the response variable for a given value of the explanatory variable is called prediction, simply by plugging in the value of x in the linear model equation.
  • There will be some uncertainty associated with the predicted value.

EXTRAPOLATION

  • Applying a model estimate to values outside of the realm of the original data is called extrapolation.
  • Sometimes the intercept might be an extrapolation.

\(R^2\)

  • The strength of the fit of a linear model is most commonly evaluated using \(R^2\).
  • \(R^2\) is calculated as the square of the correlation coefficient.
  • It tells us what percent of variability in the response variable is explained by the model.
  • The remainder of the variability is explained by variables not included in the model or by inherent randomness in the data.

REFERENCES

  • Diez et al. (2019) OpenIntro Statistics, Fourth Edition