Reasonably large \(n\) or normality assumption: \(\epsilon_i \sim N(0, \sigma^2)\)
Computation: Least Squares Estimate
The goal is to estimate the parameters \(\beta_0\) and \(\beta_1\).
The most common method is the least squares estimate.
The least squares estimate minimizes the sum of squared residuals, which is the difference between the observed values and the predicted values.
In other words, among all possible lines we can pass through the data, we choose the one with the smallest Residual Sum of Squares. This line is called the least-squares regression line.
Computation: Least Squares Estimate
Computation: Least Squares Estimate
Least Squares Estimate: looking for \(\hat\beta_0\) and \(\hat\beta_1\) that minimizes \(\sum_{i=1}^n (y_i - \beta_0 -\beta_1 x_i)^2\).
Other notations: \[b_0=\hat \beta_0, b_1=\hat\beta_1\]
The process of finding \((b_0, b_1)\) involves solving two linear equations jointly.
Residual sum of squares (RSS): \(RSS=\sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i -\hat y_i)^2\)
An unbiased estimate of \(\sigma^2\) is \(s^2\): \[s^2= \frac{RSS}{n-2}\]
Standard Error and Confidence Interval
The standard error of \(\hat\beta_1\) is \[se(\hat \beta_1)=\frac{\sqrt{s^2}}{\sqrt{\sum (x_i - \bar{x})^2}}\]
A \(100(1-\alpha)\%\) confidence interval for \(\beta_1\) is \[\hat\beta_1 \pm t_{\alpha / 2, n - 2} se(\hat\beta_1)\]
Ex 1: t and p-value
The estimated slope (\(b_1\)) is -0.029, its standard error is 0.001818, and the p-value is \(<2e-16\); the linear relationship is significant at level 0.05.
Code
summary(obj.age.hippo)
Call:
lm(formula = hippo ~ age, data = alzheimer_healthy_old)
Residuals:
Min 1Q Median 3Q Max
-2.6858 -0.4791 -0.0159 0.4793 2.7005
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.420392 0.126407 66.61 <2e-16 ***
age -0.029090 0.001818 -16.00 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7131 on 1477 degrees of freedom
Multiple R-squared: 0.1478, Adjusted R-squared: 0.1472
F-statistic: 256.2 on 1 and 1477 DF, p-value: < 2.2e-16
Ex 1: Confidence Interval
The estimated slope (\(b_1\)) is -0.029.
A 95% confidence interval is (-0.033, -0.026).
Code
confint(obj.age.hippo)
2.5 % 97.5 %
(Intercept) 8.17243458 8.66834865
age -0.03265501 -0.02552456
Estimation vs Prediction
Question 1: What is the estimated mean hippocampal volume for 46-year-old women?
Two Sample t-test
data: hippo by female
t = 11.938, df = 1477, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
0.4029661 0.5614304
sample estimates:
mean in group 0 mean in group 1
6.735692 6.253494
Code
# equivalently, you can use#t.test(alzheimer_healthy_old$hippo[alzheimer_healthy_old$female==1], alzheimer_healthy_old$hippo[alzheimer_healthy_old$female==0], var.equal =T)
Categorical Variables in Linear Regression
Categorical variables can be included in linear regression models as factors.
Consider the dignosis variable in the alzheimer data, which has three levels: 0 (healthy), 1 (MCI), and 2 (AD).
What happens if we use it as a numeric variable (this might not be appropriate in many situations)
What happens if we use it as a categorical variable (this is the most appropriate way in this case).
Ex3: When mistakenly treating diagonosis as a numerical variable
When using the numerical coding, the model assumes a linear relationship between diagnosis and hippocampal volume:
It assumes that mean difference between MCI and healthy is the same as the mean difference between AD and MCI.
This might not be the most appropriate way.
Code
alzheimer_data <-read.csv('data/alzheimer_data.csv') %>%select(id, diagnosis, age, educ, female, height, weight, lhippo, rhippo) %>%# mutate(diagnosis = as.factor(diagnosis), female = as.factor(female), hippo=lhippo+rhippo)mutate(hippo=lhippo+rhippo)#Note that factor here was not treated as a categorical variable, it is treated a numerical variablesummary(lm(hippo ~ diagnosis, data=alzheimer_data))
Call:
lm(formula = hippo ~ diagnosis, data = alzheimer_data)
Residuals:
Min 1Q Median 3Q Max
-4.5296 -0.5940 -0.0046 0.5760 3.4734
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.41399 0.02154 297.7 <2e-16 ***
diagnosis -0.46342 0.02106 -22.0 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8761 on 2698 degrees of freedom
Multiple R-squared: 0.1522, Adjusted R-squared: 0.1518
F-statistic: 484.2 on 1 and 2698 DF, p-value: < 2.2e-16
Ex3: When mistakenly treating diagonosis as a numerical variable
Code
alzheimer_data <-read.csv('data/alzheimer_data.csv') %>%select(id, diagnosis, age, educ, female, height, weight, lhippo, rhippo) %>%mutate(female =as.factor(female), hippo = lhippo + rhippo)# Fit model treating diagnosis as factorfit <-lm(hippo ~ diagnosis, data = alzheimer_data)coefs <-coef(fit)# Model-based means for each groupmodel_means <-data.frame(diagnosis =c("Healthy", "MCI", "AD"),hippo =c(coefs[1], coefs[1] + coefs[2], coefs[1] +2*coefs[2]),label =c("b0", "b0 + b1", "b0 + 2*b1"))alzheimer_data <-mutate(alzheimer_data, diagnosis =factor(diagnosis, levels=c("0","1","2"), labels =c("Healthy", "MCI", "AD")))# Plotggplot(alzheimer_data, aes(x = diagnosis, y = hippo)) +geom_boxplot(fill ="#dceeff", outlier.shape =NA) +# Add model-predicted means and connect themgeom_point(data = model_means, aes(y = hippo), color ="red", size =3) +geom_line(data = model_means, aes( y = hippo, group =1), color ="black", linewidth =1) +# Label model meansgeom_text(data = model_means, aes(x=1:3+0.2,y = hippo +0.5, label = label), color ="red", size =4) +labs(x ="Diagnosis",y ="Total Hippocampal Volume",title ="Linear Model Estimates with Diagnosis as Numerical" ) +theme_minimal()
Use diagnosis as a Categorical Variables in Linear Regression
Instead, we can treat diagnosis as a categorical variable (factor).
R automatically does dummy coding to create two binary variables: one for MCI and one for AD, with healthy as the reference group.
For hippocampal volume,
the estimated coefficients are the mean difference between each diagnosis group and the reference group (healthy).
the estimated mean for healthy is ( ), for MCI is ( ), and for AD is ( ).
Call:
lm(formula = hippo ~ diagnosis, data = alzheimer_data)
Residuals:
Min 1Q Median 3Q Max
-4.5797 -0.5862 -0.0074 0.5770 3.4233
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.53731 0.03720 148.849 < 2e-16 ***
diagnosisHealthy 0.89476 0.04339 20.621 < 2e-16 ***
diagnosisMCI 0.32278 0.05131 6.291 3.66e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8748 on 2697 degrees of freedom
Multiple R-squared: 0.155, Adjusted R-squared: 0.1544
F-statistic: 247.3 on 2 and 2697 DF, p-value: < 2.2e-16
Ex3: Use diagnosis as a Categorical Variables in Linear Regression
Linear Regression and ANOVA
For a binary X (Ex2), we have seen how to use linear regression to perform a two-sample t-test
In a categorical X (Ex3), a Linear regression basic provides information about the means. If we would like to test whether the categorical variable diagnosis is associated with hippo, we can use either linear regression or ANOVA.
ANOVA using the aov function
Conduct ANOVA Using linear regression
Two ways to preform ANOVA using linear regression:
Using the lm function to fit the model and then using anova to perform ANOVA
Interpretation of the coefficients in a multiple regression
In this multiple regression, the fitted model is \[y=1.805 + 0.071 *height - 0.087 *female\]
The intercept 1.805 is the estimated mean hippocampal volume for male who is height is 0 in (this is not a meaningful interpretation, but it is the intercept of the fitted model).
The slope 0.071 is the estimated increase in hippocampal volume for each in increase in height, holding gender constant.
The slope -0.087 is the estimated difference between women and men with the same height.
Fitting Lines
There are two fitted lines
men: y=1.805 + 0.071 *height
women: y=(1.805- 0.087) + 0.071 *height
Code
ggplot(alzheimer_healthy_old, aes(x = height, y = hippo, color = female)) +geom_point(alpha =0.6) +geom_smooth(method ="lm", se =FALSE) +labs(x ="Height (cm)",y ="Hippocampal Volume",color ="Gender",title ="Fitted Lines from hippo ~ height + female" ) +theme_minimal()
Model Diagnostics and Outliers
The typical assumptions of linear regression models are
Linearity
Independent observations
Constant variance and normality of the error term (residuals)
The first two assumptions are usually justified by our domain knowledge, our study design, and simple visualization of data.
To investigate the validity of the third assumptions we can use diagnostic plots by visualizing the residuals
Outliers are points that do not follow the general pattern of the majority of the data. Visualization methods are often used.
The two fitted lines in Ex4 are parallel to each other.
This is because we assumed an additive effect, i.e., gender and height contribute to hippocampal volume in an additive manner. In other words, the effect of height does not depend on gender.
In other examples, the effect of one explanatory variable on the response depends on other/another explanatory variable (s). In this case, we need to include an interaction term in the model.
Ex 5: A Model with Interactions
Let y be hippocampal volume
Example: \[y=\beta_0 + \beta_{age} age + \beta_f f + \beta_{age\times f} age\times f + \epsilon\] where \(f\) means female.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.182369560 0.210806589 43.558266 3.789072e-267
age -0.035269930 0.003008468 -11.723553 2.059224e-30
female1 -1.041026159 0.255296507 -4.077714 4.790840e-05
age:female1 0.007708627 0.003656786 2.108034 3.519639e-02
The estimated brain shrinking speed for men is 0.035cc per year.
The estimated brain shrinking speed for women is \(|-0.035+0.0077|\approx 0.0273\)cc per year.
The result about the interaction item indicates that the difference in brain shrinking speed between men and women is 0.0077cc per year and it is statistically significant at significance level 0.05.
Transform \(X\) and \(Y\)
Interpret transformed data
Weight is a little bit skewed. Regress hippocampus volume on log(weight). How to interpret the results?
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.214054 0.46579899 2.60639 9.242282e-03
log(weight) 1.019022 0.09111367 11.18407 6.237857e-28
\(y=\beta_0 + \beta_1 log(x) + \epsilon\)
The estimated change in \(y\) is \(b_1=1.013243\) when \(log(x)\) increases by 1, i.e., from \(log(x)\) to \(log(x)+1=log(x)+log(e)=log(ex)\).
Converting it to changes in \(x\), \(b_1=1.013243\) is the estimated change in \(y\) (hippo volume) when \(x\) (weight) increases to\(ex\), i.e, when \(x\) increases by\(\frac{ex-x}{x}=e-1 \approx 1.72 = 172\%\).
It depends on the context, we might be interested in calculating the estimated change in \(y\) with \(1\%\) increase in \(x\): \(b_1 \times (log(1.01)-log(1)) \approx 0.1 * b_1\).
Similarly, we can calculate the estimated change in \(y\) with \(10\%\) increase in \(x\): \(b_1 \times (log(1.10)- log(1)) \approx 0.95 b_1\).
\(log(y)=\beta_0 + \beta_1 x + \epsilon\)
Regress log(weight) on height. How to interpret the results?