SPRING 2020
In these notes, we will focus on quantifying the relationship between a
numerical response variable and
a numerical explanatory variable
The scatterplot below shows the relationship between HS graduate rate in all 3083 US counties and the percent of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).
The linear model for predicting poverty from high school graduation rate in the US is \[\widehat{poverty}=64.594-0.591\times hs\_grad\] The “hat” is used to signify this is an estimate.
The high school graduate rate in Hampshire County, MA is 92.4%. What poverty level does the model predict for this county?
64.594-0.591*92.4=10.009
Residuals are the leftovers from the model fit:
\[Data = Fit + Residual\]
Residual is the difference between the observed (\(y_i\)) and predicted \(\hat{y}_i\).
\[\underbrace{\hat{e}_i}_{residual}=\underbrace{y_i}_{data}-\underbrace{\hat{y}_i}_{fit}\]
1.691
Correlation describes the strength of the linear association between two variables.
It takes values between -1 (perfect negative) and +1 (perfect positive).
A value of 0 indicates no linear association.
Which of the following is has the strongest correlation, i.e. correlation coefficient closest to +1 or -1?
Which of the following is has the strongest correlation, i.e. correlation coefficient closest to +1 or -1? (b) correlation - linear association
We want a line that has small residuals
\[\underbrace{\hat{y}}_{predicted \ y}=\underbrace{\hat{\beta}_0}_{est. \ intercept}+\underbrace{\hat{\beta_1}}_{est. \ slope}\underbrace{x}_{explanatory \ variable}\] Intercept:
Slope:
The relationship between the response and explanatory variable should be linear .
Methods to fit nonlinear relationships exist - see advanced statistics classes for this.
To assess whether this condition is met, check the residual plot:
The slope of the regression line can be calculated as
\[\hat{\beta}_1=\frac{s_y}{s_x}R\] where \(s_y\) and \(s_x\) are the standard deviations associated with \(y\) and \(x\) and \(R\) is the correlation.
The intercept is where the regression line intersects the y-axis. The calculation of the intercept uses the fact the a regression line always passes through (\(\bar{x}\), \(\bar{y}\)).
\[\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}\]
Intercept: When x = 0, y is expected to equal the intercept.
Slope: For each unit in x, y is expected to increase/decrease on average by the slope.
These statements are not causal unless the study is a randomized controlled experiment.