Zhaoxia Yu
library(fabricerin)
library(tidyverse)
library(ggplot2)
#read, select variables
#remove impaired, and define new variables
alzheimer_data <- read.csv('../data/alzheimer_data.csv') %>%
select(id, diagnosis, age, educ, female, height, weight, lhippo, rhippo) %>%
filter(diagnosis!=1) %>%
mutate(alzh=(diagnosis>0)*1, female = as.factor(female), hippo=lhippo+rhippo)
table(alzheimer_data$alzh)
0 1
1534 553
alzh.lm
alzh.lm=lm(alzh ~ hippo, data=alzheimer_data)
#add lm.pred variable
alzheimer_data$lm.pred=predict(alzh.lm)
#add red flags
alzheimer_data$lm.pred.red = alzheimer_data$lm.pred
alzheimer_data$lm.pred.red[alzheimer_data$lm.pred<1 &
alzheimer_data$lm.pred>0]=NA
ggplot(alzheimer_data) +
geom_point(aes(x=alzh, y=lm.pred)) +
geom_point(aes(x=alzh, y=lm.pred.red), col="red") +
geom_hline(yintercept = c(0,1), col='red') +
labs(x="true alzh", y="predicted by lm")
alzh.lm
a good model?Recall that alzh
is a binary variable.
We tried linear regression: alzh
\(\sim\) hippo
alzh
can be greater than 1 or less than 0.Now consider situations where the response variable \(Y\) is a binary random variable (e.g., disease status).
Recall that a Bernoulli distribution can be used to characterize the behavior of a binary random variable.
\[Y \sim Bernoulli(p),\]
where, \(p\) is the probability of the outcome of interest (denoted as 1) given the explanatory variables.
\[odds=\frac{P(A)}{P(A^c)}=\frac{P(A)}{1-P(A)}.\]
\[odds = \frac{P(Y=1)}{P(Y=0)}=\frac{p}{1-p}\]
\[logit(p)=log \left( \frac{p}{1-p}\right).\]
Linear regression is not good choice for a binary response variable.
For linear regression models, the response variable, \(Y\), is assumed to be a real-valued continuous random variable.
When the response variable is binary, such as alzh
, how should we model it using one or multiple covarites?
For such problems, it is common to use logistic regression instead
\[\begin{align*} \log \left(\frac{\hat{p}}{1- \hat{p}} \right) & = b_{0} + b_{1}x_1 + \ldots + b_{q}x_{q} \end{align*}\]
The term \(\Big(\frac{\hat{p}}{1- \hat{p}} \Big)\) is called the odds of \(Y=1\).
The term \(\log \Big(\frac{\hat{p}}{1- \hat{p}} \Big)\), i.e., log of odds, is called the logit function.
Although \(p\) is a real number between 0 and 1, its logit transformation can be any real number between \(-\infty\) to \(+\infty\).
As an example, we use the birthwt
data set to model the relationship between having low birthweight babies (a binary variable), \(Y\), and smoking during pregnancy, \(X\).
The binary variable low
identifies low birthweight babies (low = 1 for low birthweight babies, and 0 otherwise).
The binary variable smoke
identifies mothers who were smoking during pregnancy (smoke=1 for smoking during pregnancy, and 0 otherwise).
chisq.test
can be used for testing the associaiton between them.
A logistic regression provides more information
glm()
function in R to fit a regression modelRows: 189
Columns: 10
$ low <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ age <int> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30, 18, 18, …
$ lwt <int> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, 150, 95, 1…
$ race <fct> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1, 3, 1, 3, 1…
$ smoke <fct> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0…
$ ptl <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ ht <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ui <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1…
$ ftv <int> 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0, 1, 2, 3…
$ bwt <int> 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663, 2665, 2722…
Call: glm(formula = low ~ smoke, family = "binomial", data = birthwt)
Coefficients:
(Intercept) smoke1
-1.0871 0.7041
Degrees of Freedom: 188 Total (i.e. Null); 187 Residual
Null Deviance: 234.7
Residual Deviance: 229.8 AIC: 233.8
glm()
function in R to fit a regression modelglm()
function in R to fit a regression model
Call:
glm(formula = low ~ smoke, family = "binomial", data = birthwt)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0197 -0.7623 -0.7623 1.3438 1.6599
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0871 0.2147 -5.062 4.14e-07 ***
smoke1 0.7041 0.3196 2.203 0.0276 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 229.80 on 187 degrees of freedom
AIC: 233.8
Number of Fisher Scoring iterations: 4
For the above example, the estimated values of the intercept \(\alpha\) and the regression coefficient \(\beta\) are \(b_0=-1.09\) and \(b_1=0.70\) respectively.
Therefore,
\[\begin{align*} \frac{\hat{p}}{1- \hat{p}} & = & \exp(-1.09 + 0.70x) \end{align*}\]
Here, \(\hat{p}\) is the estimated probability of having a low birthweight baby for a given \(x\).
The left-hand side of the above equation is the estimated odds of having a low birthweight baby.
For non-smoking mother, \(x=0\), the odds of having low birthweight baby is \[\begin{align*} \frac{\hat{p}_{0}}{1- \hat{p}_{0}} & = & \exp(-1.09) \\ & = & 0.34 \end{align*}\]
That is, the exponential of the intercept is the odds when \(x=0\), which is sometimes referred to as the baseline odds.
For mothers who smoke during pregnancy, \(x=1\), and \[\begin{align*} \frac{\hat{p}_{1}}{1- \hat{p}_{1}} & = & \exp(-1.09 + 0.7)\\ & = & \exp(-1.09) \exp(0.7)\\ & = & 0.68 \end{align*}\]
As we can see, corresponding to one unit increase in \(x\) from \(x=0\) (non-smoking) to \(x=1\) (smoking), the odds multiplicatively increases by the exponential of the regression coefficient.
\[\begin{align*} \frac{\frac{\hat{p}_{1}}{1- \hat{p}_{1}}}{\frac{\hat{p}_{0}}{1- \hat{p}_{0}}} & = & \frac{\exp(-1.09) \exp(0.7)}{\exp(-1.09)} = \exp(0.7) = 2.01 \end{align*}\]
We can interpret the exponential of the regression coefficient as the odds ratio of having low birthweight babies for smoking mothers compared to non-smoking mothers.
Here, the estimated odds ratio is \(\exp(0.7) = 2.01\) so the odds of having a low birthweight baby almost doubles for smoking mothers compared to non-smoking mothers.
In general,
if \(b_1>0\), then \(\exp(b_1) > 1\) so the odds increases as \(X\) increases;
if \(b_1<0\), then \(0 < \exp(b_1) < 1\) so the odds decreases as \(X\) increases;
if \(b_1=0\), the odds ratio is 1 so the odds does not change with \(X\) according to the assumed model.
\[\begin{align*} \hat{p} & = & \frac{\exp(b_0 + b_1x)}{1 + \exp(b_0 + b_1x)} \end{align*}\]
Therefore, the estimated probability of having a low birthweight baby for non-smoking mothers, \(x=0\), is \[\begin{align*} \hat{p} & = & \frac{\exp(-1.09)}{1 + \exp(-1.09)} = 0.25 \end{align*}\]
This probability increases for mothers who smoke during pregnancy, \(x=1\), \[\begin{align*} \hat{p} & = & \frac{\exp(-1.09 + 0.7)}{1 + \exp(-1.09 + 0.7)} = 0.40 \end{align*}\]
That is, the risk of having a low birthweight baby increases by 60% if a mother smokes during her pregnancy.
For the most part, we follow similar steps to fit the model, estimate regression parameters, perform hypothesis testing, and predict unknown values of the response variable.
As an example, we want to investigate the relationship between having a low birthweight baby, \(Y\), and mother’s age at the time of pregnancy, \(X\).
Finding confidence intervals and performing hypothesis testing remain as before, so we focus on prediction and interpreting the point estimates.
For the above example, the point estimates for the regression parameters are \(b_0=0.38\) and \(b_1=-0.05\).
While the intercept is the log odds when \(x=0\), it is not reasonable to interpret its exponential as the baseline odds since mother’s age cannot be zero.
The odds ratio for comparing 21 year old mothers to 20 year old mothers is \[\begin{align*} \frac{\frac{\hat{p}_{21}}{1- \hat{p}_{21}}}{\frac{\hat{p}_{20}}{1- \hat{p}_{20}}} & = & \frac{ \exp(0.38) \exp(- 0.05 \times 21))}{ \exp(0.38) \exp(- 0.05 \times 20)}\\ & = & \exp(- 0.05 \times 21 + 0.05 \times 20)\\ & = & \exp(- 0.05) \end{align*}\]
Therefore, \(\exp(b_1)\) is the estimated odds ratio comparing 21 year old mothers to 20 year old mothers.
\[\begin{align*} \frac{\frac{\hat{p}_{x+1}}{1- \hat{p}_{x+1}}}{\frac{\hat{p}_{x}}{1- \hat{p}_{x}}} & = & \exp(b_1) \end{align*}\]
As before, we can use the estimated regression parameters to find \(\hat{p}\) and predict the unknown value of the response variable. \[\begin{align*} \hat{p} & = & \frac{\exp(b_0 + b_1x)}{1 + \exp(b_0 + b_1x)} & = & \frac{\exp(0.38 -0.05x)}{1 + \exp(0.38 -0.05x)}. \end{align*}\]
For example, for mother who are 20 years old at the time of pregnancy, the estimated probability of having a low birthweight baby is \[\begin{align*} \hat{p} & = & \frac{\exp(0.38 -0.05 \times 20)}{1 + \exp(0.38 -0.05 \times 20)} = 0.35. \end{align*}\]
Including multiple explanatory variables (predictors) in a logistic regression model is easy.
Similar to linear regression models, we specify the model formula by entering the response variable on the left side of the “~” symbol and the explanatory variables (separated by “+” sings) on the right side.
Call: glm(formula = low ~ age + smoke, family = "binomial", data = birthwt)
Coefficients:
(Intercept) age smoke1
0.06091 -0.04978 0.69185
Degrees of Freedom: 188 Total (i.e. Null); 186 Residual
Null Deviance: 234.7
Residual Deviance: 227.3 AIC: 233.3
Call:
glm(formula = low ~ age + smoke, family = "binomial", data = birthwt)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1589 -0.8668 -0.7470 1.2821 1.7925
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.06091 0.75732 0.080 0.9359
age -0.04978 0.03197 -1.557 0.1195
smoke1 0.69185 0.32181 2.150 0.0316 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 227.28 on 186 degrees of freedom
AIC: 233.28
Number of Fisher Scoring iterations: 4
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.093409 0.41017858 14.85550 6.408616e-50
hippo -1.184699 0.06916233 -17.12926 8.979352e-66
Note, glm
stands for generalized linear models, which include linear regression and logistic regression as special situations.
\(b_0=6.093409, b_1=-1.184699\).
How should we interpret these values?
\[logit(\hat p_0) = b_0 + b_1 x_0\]
\[logit(\hat p_1) = b_0 + b_1 (x_0+1)\]
We have obtained the change in log-odds associated with one-unit increase in \(x\).
How can we convert it to changes in odds, which is more well understandable?
By the definition of logit
, we have
\[logit(\hat p_1) - logit(\hat p_0)=log\frac{\hat p_1}{1-\hat p_1} -log \frac{\hat p_0}{1-\hat p_0}=log \frac{\frac{\hat p_1}{1-\hat p_1}}{\frac{\hat p_0}{1-\hat p_0}}=-1.184699\]
\[\frac{\frac{\hat p_1}{1-\hat p_1}}{\frac{\hat p_0}{1-\hat p_0}}=exp(-1.184699)=0.3058\]
confint
function in R can be used to find confidence intervals for log-odds.Recall that the estimated decrease of odds in AD associated with one-unit increase of hippocampal volume is 69.42%.
A 95% confidence interval is [64.98%, 73.29%].
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.093409 0.41017858 14.85550 6.408616e-50
hippo -1.184699 0.06916233 -17.12926 8.979352e-66
predict
, it is necessary to specify the correct type of prediction.
response
type gives predicted probability.Similar to linear regression, we can include multiple explanatory variables in logistic regression.
Connect \(p\) (i.e., \(P(Y=1)\)) to a linear function of the predictors: \[log\frac{\hat p}{1-\hat p} = b_0 + b_1 \times x_{1} + … + b_p \times x_{p}\]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.02463880 0.455844285 -4.441514 8.932811e-06
age 0.04015417 0.005001781 8.027974 9.909571e-16
female1 -0.87341587 0.105761683 -8.258339 1.477131e-16
educ -0.08759757 0.015192451 -5.765862 8.124175e-09
The odds of AD in one year later is \(exp(0.04015417)=\) of the current odds.
When holding gender and education fixed, the estimated increase in odds of AD in a year is \(e^{0.04015417}-1=4.097\%\)
A 95% confidence interval is [3.08%, 5.12%].
Estimate Std. Error z value Pr(>|z|)
age 0.04015417 0.005001781 8.027974 9.909571e-16
female1 -0.87341587 0.105761683 -8.258339 1.477131e-16
educ -0.08759757 0.015192451 -5.765862 8.124175e-09