COSMOS logo

Linear Regression

Zhaoxia Yu

Load packages, read data

Code
library(ggplot2)
library(tidyverse)


alzheimer_data <- read.csv('data/alzheimer_data.csv') %>% 
  select(id, diagnosis, age, educ, female, height, weight, lhippo, rhippo) %>% 
  mutate(diagnosis = as.factor(diagnosis), 
         female = as.factor(female),
         hippo=lhippo+rhippo)


alzheimer_healthy <- read.csv('data/alzheimer_data.csv') %>% 
  select(id, diagnosis, age, educ, female, height, weight, lhippo, rhippo) %>% 
  mutate(hippo=lhippo+rhippo, 
         diagnosis = as.factor(diagnosis), 
         female = as.factor(female)) %>% filter(diagnosis==0)


#define the subset of older (>45) healthy subjects
alzheimer_healthy_old <- alzheimer_healthy %>% filter(age>45)

Outline

  • Introduction
    • Linear Relationship
    • Correlation and Linear Regression
  • Linear Regression
    • The Response Variable
    • One explanatory variable
      • Ex1. A numerical X: hippo ~ age
      • Ex2. A Binary X: hippo ~ female
      • Ex3. A categorical Variables X: hippo ~ diagnosis

Outline

  • Multiple Linear Regression
    • Motivating Example
    • Can the difference be explained by height?
  • Outliers and model Diagnostics
  • Advanced Topics
    • Interactions
    • Transform \(X\) and \(Y\)

Introduction

Review: Example 6 of Day 4

  • Research Question: Is hippocampal volume correlated with age in healthy adults?

  • Null Hypothesis: hippocampal volume and age are not correlated.

  • Alternative Hypothesis: hippocampal volume and age are correlated.

\[H_0: \rho=0 \mbox{ vs } H_1: \rho\not=0\]

  • Method: correlation test, more general, linear regression.

Linear Relationship

Code
#plot(alzheimer_healthy$age, alzheimer_healthy$hippo, xlab="age", ylab="hippo")

ggplot(alzheimer_healthy, aes(x=age, y=hippo)) +
  geom_point() + 
  labs(x = "age", y="hippo volume", title="hippo volume vs age")
  • The relationship is roughly (negatively) linear except for age <45.

Linear Relationship

Code
ggplot(alzheimer_healthy, aes(x=lhippo, y=rhippo)) +
  geom_point() + 
  labs(x = "left hippo", y="right hippo", title="hippo volume")

Strong vs Weak Linear Relationship

Code
par(mfrow=c(1,2))
plot(alzheimer_healthy$lhippo, alzheimer_healthy$rhippo,
     xlab = "left hippo", ylab="right hippo", main="hippo volume")
plot(alzheimer_healthy$weight, alzheimer_healthy$hippo,
     xlab = "weight", ylab="hippo", main="hippo vs weight")

Correlation vs Linear Regression

  • Correlation is a measure of the strength of linear relationship between two continuous variables.
Code
print("correlation between left and right hippo volume")
[1] "correlation between left and right hippo volume"
Code
cor(alzheimer_healthy$lhippo, alzheimer_healthy$rhippo)
[1] 0.8626165
Code
print("correlation between weight and hippo volume")
[1] "correlation between weight and hippo volume"
Code
cor(alzheimer_healthy$weight, alzheimer_healthy$hippo)
[1] 0.270462

Why do we need linear regression

  • Interpretation: What is the expected decrease in hippocampal volume with a one-year increase in age?

  • Account for covariates using multiple linear regression. Adjust for covariates is particularly important in observational studies.

    • E.g., association between obesity and gender might be dependent on other factors, such as ethnicity, countries, income, etc
  • Prediction: What is the predicted hippocampal volume for a 46-yar old healthy woman?

Linear Regression

The Response Variable

  • In a linear model, the response variable is the variable that we are trying to predict or explain.

  • A response variable is also known as a dependent variable.

  • It is the outcome or the variable that is being studied and modeled as a function of one or more explanatory (independent) variables.

  • We often use \(y\) to represent the response variable.

Simple Linear Regression (SLM)

Simple Linear Regression

  • Simple linear regression is the regression model with only one explanatory variable.

  • It can be used to assess the linear relationship between two variables. Example, age and hippocampal volume in healthy older (45+) subjects

Code
ggplot(alzheimer_healthy_old, aes(x=age, y=hippo)) +
  geom_point() + 
  labs(x = "age", y="hippo", title="hippo volume vs age in healthy older subjects (45+)")

The Model

  • The linear model is represented as:

\[y=\beta_0 + \beta_1 x + \epsilon\]

  • \(y\): Response variable (dependent variable)

  • \(x\): Explanatory variable (independent variable)

  • \(\beta_0\): Intercept (constant term)

  • \(\beta_1\): Slope (rate of change)

  • \(\epsilon\): Error term (residuals)

  • It is also common to add subject indices \[y_i=\beta_0 + \beta_1 x_i + \epsilon_i, i=1, \cdots, n\]

Model Assumptions

  • \(y_i=\beta_0 + \beta_1 x_i + \epsilon_i, i=1, \cdots, n\)
  • The \(n\) observations are independent
  • Reasonably large \(n\) or normality assumption: \(\epsilon_i \sim N(0, \sigma^2)\)

Computation: Least Squares Estimate

  • The goal is to estimate the parameters \(\beta_0\) and \(\beta_1\).
  • The most common method is the least squares estimate.
  • The least squares estimate minimizes the sum of squared residuals, which is the difference between the observed values and the predicted values.
  • In other words, among all possible lines we can pass through the data, we choose the one with the smallest Residual Sum of Squares. This line is called the least-squares regression line.

Computation: Least Squares Estimate

Computation: Least Squares Estimate

  • Least Squares Estimate: looking for \(\hat\beta_0\) and \(\hat\beta_1\) that minimizes \(\sum_{i=1}^n (y_i - \beta_0 -\beta_1 x_i)^2\).

  • Other notations: \[b_0=\hat \beta_0, b_1=\hat\beta_1\]

  • The process of finding \((b_0, b_1)\) involves solving two linear equations jointly.

    • \(b_0=\hat\beta_0=\bar{y}-b_1\bar{x}\)
    • \(b_1=\hat\beta_1=\dfrac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i- \bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\)

Interpretation of Simple Linear Regression

Ex 1: hippo vs age

Code
#define an lm object
obj.age.hippo=lm(hippo~age, data=alzheimer_healthy_old)
summary(obj.age.hippo) 

Call:
lm(formula = hippo ~ age, data = alzheimer_healthy_old)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6858 -0.4791 -0.0159  0.4793  2.7005 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.420392   0.126407   66.61   <2e-16 ***
age         -0.029090   0.001818  -16.00   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7131 on 1477 degrees of freedom
Multiple R-squared:  0.1478,    Adjusted R-squared:  0.1472 
F-statistic: 256.2 on 1 and 1477 DF,  p-value: < 2.2e-16

Ex 1: The Fitted Line

  • The fitted line is \(hippo = 8.42 - 0.029 \times age\)
Code
ggplot(alzheimer_healthy_old, aes(x=age, y=hippo)) +
  geom_point() + 
  labs(x = "age", y="hippo", title="hippo vs age")+
   geom_smooth(method='lm', se=FALSE)

Ex 1: interpret the Coefficients

  • Interpretation of \(b_0=8.42\)
    • mathematically, it is the estimated hippocampal value for a person at age 0.
    • statistically, this interpretation involves extrapolation, which should be avoided. Rational:
      • The linear trend we observed is for healthy subjects who are 45 or older.
      • It is dangerous to assume that the same linear trend holds at around age 0.
  • Interpretation of \(b_1= - 0.029\): for each added year, the hippocampal size decreased by 0.029 (cm\(^3\)).

Residuals

  • \(e_i = y_i - \hat y_i\) where \(\hat y_i = \hat\beta_0 + \hat \beta_1 x_i\)
  • Residual sum of squares (RSS): \(RSS=\sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i -\hat y_i)^2\)
  • An unbiased estimate of \(\sigma^2\) is \(s^2\): \[s^2= \frac{RSS}{n-2}\]

Standard Error and Confidence Interval

  • The standard error of \(\hat\beta_1\) is \[se(\hat \beta_1)=\frac{\sqrt{s^2}}{\sqrt{\sum (x_i - \bar{x})^2}}\]
  • A \(100(1-\alpha)\%\) confidence interval for \(\beta_1\) is \[\hat\beta_1 \pm t_{\alpha / 2, n - 2} se(\hat\beta_1)\]

Ex 1: t and p-value

  • The estimated slope (\(b_1\)) is -0.029, its standard error is 0.001818, and the p-value is \(<2e-16\); the linear relationship is significant at level 0.05.
Code
summary(obj.age.hippo)

Call:
lm(formula = hippo ~ age, data = alzheimer_healthy_old)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6858 -0.4791 -0.0159  0.4793  2.7005 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.420392   0.126407   66.61   <2e-16 ***
age         -0.029090   0.001818  -16.00   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7131 on 1477 degrees of freedom
Multiple R-squared:  0.1478,    Adjusted R-squared:  0.1472 
F-statistic: 256.2 on 1 and 1477 DF,  p-value: < 2.2e-16

Ex 1: Confidence Interval

  • The estimated slope (\(b_1\)) is -0.029.
  • A 95% confidence interval is (-0.033, -0.026).
Code
confint(obj.age.hippo)
                  2.5 %      97.5 %
(Intercept)  8.17243458  8.66834865
age         -0.03265501 -0.02552456

Estimation vs Prediction

  • Question 1: What is the estimated mean hippocampal volume for 46-year-old women?
Code
predict(obj.age.hippo, newdata = data.frame(age=46), interval = 'confidence')
       fit     lwr      upr
1 7.082262 6.99322 7.171303
  • Question 2: What is the predicted hippocampal volume for a 46-year-old woman?
Code
predict(obj.age.hippo, newdata = data.frame(age=46), interval = 'predict')
       fit      lwr      upr
1 7.082262 5.680708 8.483815
  • Why is the prediction interval wider than the estimation interval?

Ex 2: A Binary X

  • Now, consider the relationship between hippocampal volume and gender (the ``female``` variable in the data)
Code
ggplot(alzheimer_healthy_old, aes(x=female, y=hippo)) +
  geom_point() + 
  labs(x = "female", y="hippo", title="hippo vs female")+
   geom_smooth(method='lm', se=FALSE)

Ex2: Boxplots by gender are better

Code
ggplot(alzheimer_healthy_old, aes(x=female, y=hippo, fill=female)) +
  geom_boxplot() + 
  labs(x = "female", y="hippo", title="hippo vs female")

Ex2: Intercept and Slope

Ex2: Results and Interpretation

  • The intercept \(b_0=6.7482841\) is the estimated hippocampal volume for men (female=0)

  • The slope \(b_1=-0.4887882\) is the estimated difference between women and men in hippocampal volume

Code
summary(lm(hippo ~ female, data=alzheimer_healthy_old))$coefficients
              Estimate Std. Error   t value     Pr(>|t|)
(Intercept)  6.7356917 0.03272826 205.80660 0.000000e+00
female1     -0.4821982 0.04039219 -11.93791 1.990249e-31
  • Let’s verify this using two-sample t-test
Code
t.test(hippo~female, data=alzheimer_healthy_old, var.equal =T)

    Two Sample t-test

data:  hippo by female
t = 11.938, df = 1477, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 0.4029661 0.5614304
sample estimates:
mean in group 0 mean in group 1 
       6.735692        6.253494 
Code
# equivalently, you can use
#t.test(alzheimer_healthy_old$hippo[alzheimer_healthy_old$female==1], alzheimer_healthy_old$hippo[alzheimer_healthy_old$female==0], var.equal =T)

Categorical Variables in Linear Regression

  • Categorical variables can be included in linear regression models as factors.
  • Consider the dignosis variable in the alzheimer data, which has three levels: 0 (healthy), 1 (MCI), and 2 (AD).
  • What happens if we use it as a numeric variable (this might not be appropriate in many situations)
  • What happens if we use it as a categorical variable (this is the most appropriate way in this case).

Ex3: When mistakenly treating diagonosis as a numerical variable

  • When using the numerical coding, the model assumes a linear relationship between diagnosis and hippocampal volume:
    • It assumes that mean difference between MCI and healthy is the same as the mean difference between AD and MCI.
    • This might not be the most appropriate way.
Code
alzheimer_data <- read.csv('data/alzheimer_data.csv') %>% 
  select(id, diagnosis, age, educ, female, height, weight, lhippo, rhippo) %>% 
#  mutate(diagnosis = as.factor(diagnosis), female = as.factor(female), hippo=lhippo+rhippo)
  mutate(hippo=lhippo+rhippo)

#Note that factor here was not treated as a categorical variable, it is treated a numerical variable
summary(lm(hippo ~ diagnosis, data=alzheimer_data))

Call:
lm(formula = hippo ~ diagnosis, data = alzheimer_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5296 -0.5940 -0.0046  0.5760  3.4734 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.41399    0.02154   297.7   <2e-16 ***
diagnosis   -0.46342    0.02106   -22.0   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8761 on 2698 degrees of freedom
Multiple R-squared:  0.1522,    Adjusted R-squared:  0.1518 
F-statistic: 484.2 on 1 and 2698 DF,  p-value: < 2.2e-16

Ex3: When mistakenly treating diagonosis as a numerical variable

Code
alzheimer_data <- read.csv('data/alzheimer_data.csv') %>% 
  select(id, diagnosis, age, educ, female, height, weight, lhippo, rhippo) %>% mutate(female = as.factor(female), hippo = lhippo + rhippo)

# Fit model treating diagnosis as factor
fit <- lm(hippo ~ diagnosis, data = alzheimer_data)
coefs <- coef(fit)

# Model-based means for each group
model_means <- data.frame(
  diagnosis = c("Healthy", "MCI", "AD"),
  hippo = c(coefs[1], coefs[1] + coefs[2], coefs[1] + 2*coefs[2]),
  label = c("b0", "b0 + b1", "b0 + 2*b1")
)

alzheimer_data <- mutate(alzheimer_data, diagnosis = factor(diagnosis, levels=c("0","1","2"), labels = c("Healthy", "MCI", "AD")))
# Plot
ggplot(alzheimer_data, aes(x = diagnosis, y = hippo)) +
  geom_boxplot(fill = "#dceeff", outlier.shape = NA) +
  
  # Add model-predicted means and connect them
  geom_point(data = model_means, aes(y = hippo), color = "red", size = 3) +
  geom_line(data = model_means, aes( y = hippo, group = 1), color = "black", linewidth = 1) +
  
  # Label model means
  geom_text(data = model_means, aes(x=1:3+0.2,y = hippo + 0.5, label = label), color = "red", size = 4) +
  
  labs(
    x = "Diagnosis",
    y = "Total Hippocampal Volume",
    title = "Linear Model Estimates with Diagnosis as Numerical"
  ) +
  theme_minimal()

Use diagnosis as a Categorical Variables in Linear Regression

  • Instead, we can treat diagnosis as a categorical variable (factor).

  • R automatically does dummy coding to create two binary variables: one for MCI and one for AD, with healthy as the reference group.

  • For hippocampal volume,

    • the estimated coefficients are the mean difference between each diagnosis group and the reference group (healthy).
    • the estimated mean for healthy is ( ), for MCI is ( ), and for AD is ( ).
Code
alzheimer_data <- read.csv('data/alzheimer_data.csv') %>% 
  select(id, diagnosis, age, educ, female, height, weight, lhippo, rhippo) %>% 
  mutate(diagnosis = as.factor(diagnosis), female = as.factor(female), hippo=lhippo+rhippo)

summary(lm(hippo ~ diagnosis, data=alzheimer_data))

Call:
lm(formula = hippo ~ diagnosis, data = alzheimer_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5797 -0.5862 -0.0074  0.5770  3.4233 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.43207    0.02234  287.97   <2e-16 ***
diagnosis1  -0.57198    0.04180  -13.68   <2e-16 ***
diagnosis2  -0.89476    0.04339  -20.62   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8748 on 2697 degrees of freedom
Multiple R-squared:  0.155, Adjusted R-squared:  0.1544 
F-statistic: 247.3 on 2 and 2697 DF,  p-value: < 2.2e-16

Ex3: Use diagnosis as a Categorical Variables in Linear Regression

  • We can also recode the values of diagnosis to make it more readable.
Code
alzheimer_data <- alzheimer_data %>% 
  mutate(diagnosis = recode(diagnosis, `0` = "Healthy", `1` = "MCI", `2` = "AD"), hippo=lhippo+rhippo)
summary(lm(hippo ~ diagnosis, data=alzheimer_data))

Call:
lm(formula = hippo ~ diagnosis, data = alzheimer_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5797 -0.5862 -0.0074  0.5770  3.4233 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.43207    0.02234  287.97   <2e-16 ***
diagnosisMCI -0.57198    0.04180  -13.68   <2e-16 ***
diagnosisAD  -0.89476    0.04339  -20.62   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8748 on 2697 degrees of freedom
Multiple R-squared:  0.155, Adjusted R-squared:  0.1544 
F-statistic: 247.3 on 2 and 2697 DF,  p-value: < 2.2e-16
  • If we want to use the AD group as the reference group, we can use the relevel function to change the reference level of the factor variable.
Code
alzheimer_data <- alzheimer_data %>% 
  mutate(diagnosis = relevel(diagnosis, ref = "AD"), hippo=lhippo+rhippo)
summary(lm(hippo ~ diagnosis, data=alzheimer_data))

Call:
lm(formula = hippo ~ diagnosis, data = alzheimer_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5797 -0.5862 -0.0074  0.5770  3.4233 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5.53731    0.03720 148.849  < 2e-16 ***
diagnosisHealthy  0.89476    0.04339  20.621  < 2e-16 ***
diagnosisMCI      0.32278    0.05131   6.291 3.66e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8748 on 2697 degrees of freedom
Multiple R-squared:  0.155, Adjusted R-squared:  0.1544 
F-statistic: 247.3 on 2 and 2697 DF,  p-value: < 2.2e-16

Ex3: Use diagnosis as a Categorical Variables in Linear Regression

Linear Regression and ANOVA

  • For a binary X (Ex2), we have seen how to use linear regression to perform a two-sample t-test
  • In a categorical X (Ex3), a Linear regression basic provides information about the means. If we would like to test whether the categorical variable diagnosis is associated with hippo, we can use either linear regression or ANOVA.
  • ANOVA using the aov function

Conduct ANOVA Using linear regression

  • Two ways to preform ANOVA using linear regression:
    • Using the lm function to fit the model and then using anova to perform ANOVA
    • Using the aov function
Code
alzheimer_data <- read.csv('data/alzheimer_data.csv') %>% 
  select(id, diagnosis, age, educ, female, height, weight, lhippo, rhippo) %>% mutate(
    diagnosis = factor(diagnosis, levels=c("0","1","2"), labels = c("Healthy", "MCI", "AD")), female = as.factor(female), hippo = lhippo + rhippo)
summary(aov(hippo~diagnosis, data=alzheimer_data))
              Df Sum Sq Mean Sq F value Pr(>F)    
diagnosis      2  378.6  189.28   247.3 <2e-16 ***
Residuals   2697 2064.0    0.77                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
summary(lm(hippo~diagnosis, data=alzheimer_data))

Call:
lm(formula = hippo ~ diagnosis, data = alzheimer_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5797 -0.5862 -0.0074  0.5770  3.4233 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.43207    0.02234  287.97   <2e-16 ***
diagnosisMCI -0.57198    0.04180  -13.68   <2e-16 ***
diagnosisAD  -0.89476    0.04339  -20.62   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8748 on 2697 degrees of freedom
Multiple R-squared:  0.155, Adjusted R-squared:  0.1544 
F-statistic: 247.3 on 2 and 2697 DF,  p-value: < 2.2e-16
Code
anova(lm(hippo~diagnosis, data=alzheimer_data))
Analysis of Variance Table

Response: hippo
            Df  Sum Sq Mean Sq F value    Pr(>F)    
diagnosis    2  378.56 189.281  247.33 < 2.2e-16 ***
Residuals 2697 2064.01   0.765                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Linear Regression

Ex4: Motivating Example

  • Are healthy men and women differ in hippocampal volume?
  • In Ex 2 we found that the difference is significant at level 0.05.
  • Men and women are also different in height.
  • We can use multiple linear regression to adjust for height.

Can the difference be explained by height?

Code
ggplot(alzheimer_healthy_old, aes(x=height, y=hippo, color=female))+geom_point()

The adjusted difference

Code
summary(lm(hippo ~ height + female, data=alzheimer_healthy_old))$coefficients
               Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)  1.80472670 0.469111040  3.847121 1.246078e-04
height       0.07133566 0.006771182 10.535185 4.498807e-25
female1     -0.08743904 0.054060094 -1.617442 1.059967e-01
  • The adjusted difference between men and women

    • has a much smaller difference
    • is less significant

Interpretation of the coefficients in a multiple regression

  • In this multiple regression, the fitted model is \[y=1.805 + 0.071 *height - 0.087 *female\]
  • The intercept 1.805 is the estimated mean hippocampal volume for male who is height is 0 in (this is not a meaningful interpretation, but it is the intercept of the fitted model).
  • The slope 0.071 is the estimated increase in hippocampal volume for each in increase in height, holding gender constant.
  • The slope -0.087 is the estimated difference between women and men with the same height.

Fitting Lines

  • There are two fitted lines
    • men: y=1.805 + 0.071 *height
    • women: y=(1.805- 0.087) + 0.071 *height
Code
ggplot(alzheimer_healthy_old, aes(x = height, y = hippo, color = female)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Height (cm)",
    y = "Hippocampal Volume",
    color = "Gender",
    title = "Fitted Lines from hippo ~ height + female"
  ) +
  theme_minimal()

Model Diagnostics and Outliers

  • The typical assumptions of linear regression models are

    • Linearity
    • Independent observations
    • Constant variance and normality of the error term (residuals)
  • The first two assumptions are usually justified by our domain knowledge, our study design, and simple visualization of data.

  • To investigate the validity of the third assumptions we can use diagnostic plots by visualizing the residuals

  • Outliers are points that do not follow the general pattern of the majority of the data. Visualization methods are often used.

Advanced Topics

Advanced Topics

  • Interactions
  • Transform X and Y

Interactions: Introduction

  • In Ex 4 we fit a multiple linear regression
Code
summary(lm(hippo ~ height + female, data=alzheimer_healthy_old))$coefficients
               Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)  1.80472670 0.469111040  3.847121 1.246078e-04
height       0.07133566 0.006771182 10.535185 4.498807e-25
female1     -0.08743904 0.054060094 -1.617442 1.059967e-01
  • There are two fitted lines
    • men: y=1.805 + 0.071 *height
    • women: y=(1.805- 0.087) + 0.071 *height

Interactions: Introduction

  • The two fitted lines in Ex4 are parallel to each other.
  • This is because we assumed an additive effect, i.e., gender and height contribute to hippocampal volume in an additive manner. In other words, the effect of height does not depend on gender.
  • In other examples, the effect of one explanatory variable on the response depends on other/another explanatory variable (s). In this case, we need to include an interaction term in the model.

Ex 5: A Model with Interactions

  • Let y be hippocampal volume
  • Example: \[y=\beta_0 + \beta_{age} age + \beta_f f + \beta_{age\times f} age\times f + \epsilon\] where \(f\) means female.
Code
summary(lm(hippo~ age*female, data=alzheimer_healthy_old))$coefficients
                Estimate  Std. Error    t value      Pr(>|t|)
(Intercept)  9.182369560 0.210806589  43.558266 3.789072e-267
age         -0.035269930 0.003008468 -11.723553  2.059224e-30
female1     -1.041026159 0.255296507  -4.077714  4.790840e-05
age:female1  0.007708627 0.003656786   2.108034  3.519639e-02

Data Visualization

Code
ggplot(alzheimer_healthy_old, aes(x=age, y=hippo, color=female))+
  geom_point()+
  geom_smooth(method=lm, se=F)

The fitted lines

  • What are the two fitted lines?

  • Female: \(\hat \beta_0 + \hat \beta_{age} age + \hat \beta_f \times 1 + \hat\beta_{age\times f} (age\times 1)\), which can be rewritten to:

\[(\hat \beta_0 + \hat \beta_f) + (\hat \beta_{age} + \hat\beta_{age\times f}) age.\]

  • Male, similarly, the fitted line is \[\hat \beta_0 + \hat \beta_{age} age \]

The interaction coefficient

  • The two groups have different slopes if and only if \(\hat\beta_{age\times f} \not = 0\).

  • In other words, \(\hat\beta_{age\times f}\) is the estimated difference in the slope of age for females vs for males.

  • Back to the problem, do men and women shrink brain in the same speed?

Interpret Interactions

Code
summary(lm(hippo~ age*female, data=alzheimer_healthy_old))$coefficients
                Estimate  Std. Error    t value      Pr(>|t|)
(Intercept)  9.182369560 0.210806589  43.558266 3.789072e-267
age         -0.035269930 0.003008468 -11.723553  2.059224e-30
female1     -1.041026159 0.255296507  -4.077714  4.790840e-05
age:female1  0.007708627 0.003656786   2.108034  3.519639e-02
  • The estimated brain shrinking speed for men is 0.035cc per year.
  • The estimated brain shrinking speed for women is \(|-0.035+0.0077|\approx 0.0273\)cc per year.
  • The result about the interaction item indicates that the difference in brain shrinking speed between men and women is 0.0077cc per year and it is statistically significant at significance level 0.05.

Transform \(X\) and \(Y\)

Interpret transformed data

  • Weight is a little bit skewed. Regress hippocampus volume on log(weight). How to interpret the results?
Code
#plot(log(alzheimer_healthy_old$weight), alzheimer_healthy_old$hippo, xlab="weight", ylab="log(weight)")
summary(lm(hippo~log(weight), data=alzheimer_healthy_old))$coefficients
            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 1.214054 0.46579899  2.60639 9.242282e-03
log(weight) 1.019022 0.09111367 11.18407 6.237857e-28

\(y=\beta_0 + \beta_1 log(x) + \epsilon\)

  • The estimated change in \(y\) is \(b_1=1.013243\) when \(log(x)\) increases by 1, i.e., from \(log(x)\) to \(log(x)+1=log(x)+log(e)=log(ex)\).

  • Converting it to changes in \(x\), \(b_1=1.013243\) is the estimated change in \(y\) (hippo volume) when \(x\) (weight) increases to \(ex\), i.e, when \(x\) increases by \(\frac{ex-x}{x}=e-1 \approx 1.72 = 172\%\).

  • It depends on the context, we might be interested in calculating the estimated change in \(y\) with \(1\%\) increase in \(x\): \(b_1 \times (log(1.01)-log(1)) \approx 0.1 * b_1\).

  • Similarly, we can calculate the estimated change in \(y\) with \(10\%\) increase in \(x\): \(b_1 \times (log(1.10)- log(1)) \approx 0.95 b_1\).

\(log(y)=\beta_0 + \beta_1 x + \epsilon\)

  • Regress log(weight) on height. How to interpret the results?
Code
#plot(alzheimer_healthy_old$height, log(alzheimer_healthy_old$weight), xlab="height", ylab="log(weight)")
summary(lm(log(weight)~height, data=alzheimer_healthy_old))$coefficients
             Estimate  Std. Error  t value      Pr(>|t|)
(Intercept) 3.1652744 0.080729831 39.20824 4.708633e-231
height      0.0296628 0.001230638 24.10360 1.591322e-108
  • \(b_1=0.02935588\): when height increases 1 unit (in), weight will increase by \((e^{0.02935588}-1)*100\)=2.98%.