COSMOS logo

Logistic Regression

Zhaoxia Yu

Load packages, read data

Code

library(fabricerin)
library(tidyverse)
library(ggplot2)

Code

library(fabricerin)
library(tidyverse)
library(ggplot2)

#read, select variables
#remove impaired, and define new variables
alzheimer_data <- read.csv('../data/alzheimer_data.csv') %>% 
  select(id, diagnosis, age, educ, female, height, weight, lhippo, rhippo) %>%
  filter(diagnosis!=1) %>%
  mutate(alzh=(diagnosis>0)*1, female = as.factor(female), hippo=lhippo+rhippo) 

table(alzheimer_data$alzh)


   0    1 
1534  553

Code

ggplot(data=alzheimer_data, aes(x=alzh)) +
  geom_bar()

Outline

Introduction
- Motivating example: alzh ~ hippo
Ex1: One binary predictor
Ex2: One numerical predictor
Ex3: Multiple predictors
Additional examples
- Ex4: one numerical predictor
- Ex5: mulitple predictors

Introduction

Motivating example: alzh ~ hippo

Code

ggplot(data=alzheimer_data, aes(x=hippo, y=alzh)) +
  geom_point() +
  geom_smooth(method='lm', se=FALSE) +
  labs(title="alzh vs hippo")

Any comments about the linear fit?

Predict Binary Responses (wrong way)

Now let us do predictions using linear regression: alzh.lm

Code

alzh.lm=lm(alzh ~ hippo, data=alzheimer_data)
#add lm.pred variable
alzheimer_data$lm.pred=predict(alzh.lm)
#add red flags
alzheimer_data$lm.pred.red = alzheimer_data$lm.pred
alzheimer_data$lm.pred.red[alzheimer_data$lm.pred<1 &
                             alzheimer_data$lm.pred>0]=NA
ggplot(alzheimer_data) +
  geom_point(aes(x=alzh, y=lm.pred)) +
  geom_point(aes(x=alzh, y=lm.pred.red), col="red") +
  geom_hline(yintercept = c(0,1), col='red') +
  labs(x="true alzh", y="predicted by lm")

Is alzh.lm a good model?

Binary Response Variables

Recall that alzh is a binary variable.
We tried linear regression: alzh \(\sim\) hippo
- It does not fit well: the difference between fitted and observed is large for most data points.
- The predicted/ values for the binary response variable alzh can be greater than 1 or less than 0.

Binary Response Variables

Now consider situations where the response variable \(Y\) is a binary random variable (e.g., disease status).
Recall that a Bernoulli distribution can be used to characterize the behavior of a binary random variable.

\[Y \sim Bernoulli(p),\]

where, \(p\) is the probability of the outcome of interest (denoted as 1) given the explanatory variables.

Odds

In statistics, odds quantifies of the relative probabilities of an event and its complement

\[odds=\frac{P(A)}{P(A^c)}=\frac{P(A)}{1-P(A)}.\]

For a binary variable \(Y\). Assume \(Y\sim Bernoulli(p)\). The odds is

\[odds = \frac{P(Y=1)}{P(Y=0)}=\frac{p}{1-p}\]

The range of odds is \([0,\infty)\). Can you find a transformation such as that the range is \((-\infty, \infty)\)?

Log-Odds and Logit Transformation

\(log \left( \frac{p}{1-p}\right)\) is called the logit function, i.e.,

\[logit(p)=log \left( \frac{p}{1-p}\right).\]

Code

p=seq(0.01, 0.99, 0.01)
logit.p=log(p/(1-p))
plot(p, logit.p, type="l", xlab=expression(pi), main="the logit function")

How to Model Binary Responses?

Linear regression is not good choice for a binary response variable.
For linear regression models, the response variable, \(Y\), is assumed to be a real-valued continuous random variable.
When the response variable is binary, such as alzh, how should we model it using one or multiple covarites?
For such problems, it is common to use logistic regression instead

Logisitc Regression

In a logistic regression, we model the log odds using a linear combination of the explantoary variables (predictors):

\[\begin{align*} \log \left(\frac{\hat{p}}{1- \hat{p}} \right) & = b_{0} + b_{1}x_1 + \ldots + b_{q}x_{q} \end{align*}\]

The term \(\Big(\frac{\hat{p}}{1- \hat{p}} \Big)\) is called the odds of \(Y=1\).
The term \(\log \Big(\frac{\hat{p}}{1- \hat{p}} \Big)\), i.e., log of odds, is called the logit function.
Although \(p\) is a real number between 0 and 1, its logit transformation can be any real number between \(-\infty\) to \(+\infty\).

Ex1: One Binary Predictor

Birthweight data

As an example, we use the birthwt data set to model the relationship between having low birthweight babies (a binary variable), \(Y\), and smoking during pregnancy, \(X\).
The binary variable low identifies low birthweight babies (low = 1 for low birthweight babies, and 0 otherwise).
The binary variable smoke identifies mothers who were smoking during pregnancy (smoke=1 for smoking during pregnancy, and 0 otherwise).

Birthweight data

Code

library(MASS)
data(birthwt)
table(birthwt$low, birthwt$smoke)

chisq.test can be used for testing the associaiton between them.
A logistic regression provides more information

Logistic Regression with One Binary Predictor: Generalized linear model (glm) in R

We can use the glm() function in R to fit a regression model

Code

birthwt <- birthwt %>% 
  mutate(across(c(low, smoke, race, ht, ui), factor))
glimpse(birthwt)

Rows: 189
Columns: 10
$ low   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ age   <int> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30, 18, 18, …
$ lwt   <int> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, 150, 95, 1…
$ race  <fct> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1, 3, 1, 3, 1…
$ smoke <fct> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0…
$ ptl   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ ht    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ui    <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1…
$ ftv   <int> 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0, 1, 2, 3…
$ bwt   <int> 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663, 2665, 2722…

Generalized linear model (glm) in R

Code

fit <- glm(low ~ smoke, family = 'binomial', data = birthwt)
fit


Call:  glm(formula = low ~ smoke, family = "binomial", data = birthwt)

Coefficients:
(Intercept)       smoke1  
    -1.0871       0.7041  

Degrees of Freedom: 188 Total (i.e. Null);  187 Residual
Null Deviance:      234.7 
Residual Deviance: 229.8    AIC: 233.8

Generalized linear model (glm) in R

We can use the glm() function in R to fit a regression model

Code

confint(fit)

                 2.5 %    97.5 %
(Intercept) -1.5243118 -0.679205
smoke1       0.0786932  1.335154

Generalized linear model (glm) in R

We can use the glm() function in R to fit a regression model

Code

summary(fit)


Call:
glm(formula = low ~ smoke, family = "binomial", data = birthwt)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0197  -0.7623  -0.7623   1.3438   1.6599  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.0871     0.2147  -5.062 4.14e-07 ***
smoke1        0.7041     0.3196   2.203   0.0276 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 229.80  on 187  degrees of freedom
AIC: 233.8

Number of Fisher Scoring iterations: 4

Estimation

For the above example, the estimated values of the intercept \(\alpha\) and the regression coefficient \(\beta\) are \(b_0=-1.09\) and \(b_1=0.70\) respectively.
Therefore,

\[\begin{align*} \frac{\hat{p}}{1- \hat{p}} & = & \exp(-1.09 + 0.70x) \end{align*}\]

Here, \(\hat{p}\) is the estimated probability of having a low birthweight baby for a given \(x\).
The left-hand side of the above equation is the estimated odds of having a low birthweight baby.

Estimation

For non-smoking mother, \(x=0\), the odds of having low birthweight baby is \[\begin{align*} \frac{\hat{p}_{0}}{1- \hat{p}_{0}} & = & \exp(-1.09) \\ & = & 0.34 \end{align*}\]
That is, the exponential of the intercept is the odds when \(x=0\), which is sometimes referred to as the baseline odds.

Estimation

For mothers who smoke during pregnancy, \(x=1\), and \[\begin{align*} \frac{\hat{p}_{1}}{1- \hat{p}_{1}} & = & \exp(-1.09 + 0.7)\\ & = & \exp(-1.09) \exp(0.7)\\ & = & 0.68 \end{align*}\]
As we can see, corresponding to one unit increase in \(x\) from \(x=0\) (non-smoking) to \(x=1\) (smoking), the odds multiplicatively increases by the exponential of the regression coefficient.

Interpretation

Note that

\[\begin{align*} \frac{\frac{\hat{p}_{1}}{1- \hat{p}_{1}}}{\frac{\hat{p}_{0}}{1- \hat{p}_{0}}} & = & \frac{\exp(-1.09) \exp(0.7)}{\exp(-1.09)} = \exp(0.7) = 2.01 \end{align*}\]

We can interpret the exponential of the regression coefficient as the odds ratio of having low birthweight babies for smoking mothers compared to non-smoking mothers.
Here, the estimated odds ratio is \(\exp(0.7) = 2.01\) so the odds of having a low birthweight baby almost doubles for smoking mothers compared to non-smoking mothers.

Interpretation

In general,
- if \(b_1>0\), then \(\exp(b_1) > 1\) so the odds increases as \(X\) increases;
- if \(b_1<0\), then \(0 < \exp(b_1) < 1\) so the odds decreases as \(X\) increases;
- if \(b_1=0\), the odds ratio is 1 so the odds does not change with \(X\) according to the assumed model.

Prediction

We can use logistic regression models for predicting the unknown values of the response variable \(Y\) given the value of the predictor value \(X\).

\[\begin{align*} \hat{p} & = & \frac{\exp(b_0 + b_1x)}{1 + \exp(b_0 + b_1x)} \end{align*}\]

For the above example, \[\begin{align*} \hat{p} & = & \frac{\exp(-1.09 + 0.70x)}{1 + \exp(-1.09 + 0.70x)} \end{align*}\]

Prediction

Therefore, the estimated probability of having a low birthweight baby for non-smoking mothers, \(x=0\), is \[\begin{align*} \hat{p} & = & \frac{\exp(-1.09)}{1 + \exp(-1.09)} = 0.25 \end{align*}\]
This probability increases for mothers who smoke during pregnancy, \(x=1\), \[\begin{align*} \hat{p} & = & \frac{\exp(-1.09 + 0.7)}{1 + \exp(-1.09 + 0.7)} = 0.40 \end{align*}\]
That is, the risk of having a low birthweight baby increases by 60% if a mother smokes during her pregnancy.

Ex 2: Logistic Regression with One Numerical Predictor

For the most part, we follow similar steps to fit the model, estimate regression parameters, perform hypothesis testing, and predict unknown values of the response variable.
As an example, we want to investigate the relationship between having a low birthweight baby, \(Y\), and mother’s age at the time of pregnancy, \(X\).

Logistic Regression with One Numerical Predictor

Code

fit <- glm(low ~ age, family = 'binomial', data = birthwt)
fit


Call:  glm(formula = low ~ age, family = "binomial", data = birthwt)

Coefficients:
(Intercept)          age  
    0.38458     -0.05115  

Degrees of Freedom: 188 Total (i.e. Null);  187 Residual
Null Deviance:      234.7 
Residual Deviance: 231.9    AIC: 235.9

Logistic Regression with One Numerical Predictor

Finding confidence intervals and performing hypothesis testing remain as before, so we focus on prediction and interpreting the point estimates.
For the above example, the point estimates for the regression parameters are \(b_0=0.38\) and \(b_1=-0.05\).
While the intercept is the log odds when \(x=0\), it is not reasonable to interpret its exponential as the baseline odds since mother’s age cannot be zero.

Logistic Regression with One Numerical Predictor

To interpret \(b_1\), consider mothers whose age is 20 years old at the time of pregnancy, \[\begin{align*} \log \Big(\frac{\hat{p}_{20}}{1- \hat{p}_{20}} \Big) & = & 0.38 - 0.05 \times 20\\ \frac{\hat{p}_{20}}{1- \hat{p}_{20}} & = & \exp(0.38 - 0.05 \times 20)\\ & = & \exp(0.38) \exp(- 0.05 \times 20) \end{align*}\]

Logistic Regression with One Numerical Predictor

For mothers who are one year older (i.e., one unit increase in age), we have \[\begin{align*} \log \Big(\frac{\hat{p}_{21}}{1- \hat{p}_{21}} \Big) & = & 0.38 - 0.05 \times 21\\ \frac{\hat{p}_{21}}{1- \hat{p}_{21}} & = & \exp(0.38 - 0.05 \times 21)\\ & = & \exp(0.38) \exp(- 0.05 \times 21) \end{align*}\]

Logistic Regression with One Numerical Predictor

The odds ratio for comparing 21 year old mothers to 20 year old mothers is \[\begin{align*} \frac{\frac{\hat{p}_{21}}{1- \hat{p}_{21}}}{\frac{\hat{p}_{20}}{1- \hat{p}_{20}}} & = & \frac{ \exp(0.38) \exp(- 0.05 \times 21))}{ \exp(0.38) \exp(- 0.05 \times 20)}\\ & = & \exp(- 0.05 \times 21 + 0.05 \times 20)\\ & = & \exp(- 0.05) \end{align*}\]
Therefore, \(\exp(b_1)\) is the estimated odds ratio comparing 21 year old mothers to 20 year old mothers.

Logistic Regression with One Numerical Predictor

In general, \(\exp(b_1)\) is the estimated odds ratio for comparing two subpopulations, whose predictor values are \(x+1\) and \(x\),

\[\begin{align*} \frac{\frac{\hat{p}_{x+1}}{1- \hat{p}_{x+1}}}{\frac{\hat{p}_{x}}{1- \hat{p}_{x}}} & = & \exp(b_1) \end{align*}\]

Logistic Regression with One Numerical Predictor

As before, we can use the estimated regression parameters to find \(\hat{p}\) and predict the unknown value of the response variable. \[\begin{align*} \hat{p} & = & \frac{\exp(b_0 + b_1x)}{1 + \exp(b_0 + b_1x)} & = & \frac{\exp(0.38 -0.05x)}{1 + \exp(0.38 -0.05x)}. \end{align*}\]
For example, for mother who are 20 years old at the time of pregnancy, the estimated probability of having a low birthweight baby is \[\begin{align*} \hat{p} & = & \frac{\exp(0.38 -0.05 \times 20)}{1 + \exp(0.38 -0.05 \times 20)} = 0.35. \end{align*}\]

Ex3: Logistic Regression with Multiple Variables

Including multiple explanatory variables (predictors) in a logistic regression model is easy.
Similar to linear regression models, we specify the model formula by entering the response variable on the left side of the “~” symbol and the explanatory variables (separated by “+” sings) on the right side.

Logistic Regression with Multiple Variables

Code

fit <- glm(low ~ age + smoke, family = 'binomial', 
           data = birthwt)
fit


Call:  glm(formula = low ~ age + smoke, family = "binomial", data = birthwt)

Coefficients:
(Intercept)          age       smoke1  
    0.06091     -0.04978      0.69185  

Degrees of Freedom: 188 Total (i.e. Null);  186 Residual
Null Deviance:      234.7 
Residual Deviance: 227.3    AIC: 233.3

Logistic Regression with Multiple Variables

Code

summary(fit)


Call:
glm(formula = low ~ age + smoke, family = "binomial", data = birthwt)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1589  -0.8668  -0.7470   1.2821   1.7925  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.06091    0.75732   0.080   0.9359  
age         -0.04978    0.03197  -1.557   0.1195  
smoke1       0.69185    0.32181   2.150   0.0316 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 227.28  on 186  degrees of freedom
AIC: 233.28

Number of Fisher Scoring iterations: 4

Logistic Regression with Multiple Variables

Code

confint(fit)

                  2.5 %     97.5 %
(Intercept) -1.41611481 1.56446444
age         -0.11450032 0.01132921
smoke1       0.06214062 1.32718026

Ex 4: Logistic Regression with One Numerical Predictor

alzh vs hippo

Code

alzh.logistic=glm(alzh ~ hippo, family="binomial", data=alzheimer_data)
summary(alzh.logistic) %>% coefficients()

             Estimate Std. Error   z value     Pr(>|z|)
(Intercept)  6.093409 0.41017858  14.85550 6.408616e-50
hippo       -1.184699 0.06916233 -17.12926 8.979352e-66

Note, glm stands for generalized linear models, which include linear regression and logistic regression as special situations.
\(b_0=6.093409, b_1=-1.184699\).
How should we interpret these values?

Interpret Results

Consider \(x_i=x_0\), the log-odds of AD is

\[logit(\hat p_0) = b_0 + b_1 x_0\]

If we increase the explanatory by one unit, the log-odds becomes

\[logit(\hat p_1) = b_0 + b_1 (x_0+1)\]

Therefore, \(logit(\hat p_1)-logit(\hat p_0)=b_1=-1.184699\).

Interpret Results

We have obtained the change in log-odds associated with one-unit increase in \(x\).
How can we convert it to changes in odds, which is more well understandable?
By the definition of logit, we have

\[logit(\hat p_1) - logit(\hat p_0)=log\frac{\hat p_1}{1-\hat p_1} -log \frac{\hat p_0}{1-\hat p_0}=log \frac{\frac{\hat p_1}{1-\hat p_1}}{\frac{\hat p_0}{1-\hat p_0}}=-1.184699\]

Interpret Results

Taking the exponent on both sides of the equation,

\[\frac{\frac{\hat p_1}{1-\hat p_1}}{\frac{\hat p_0}{1-\hat p_0}}=exp(-1.184699)=0.3058\]

The new odds is 30.58% of the odds.
In other words, if the hippocampal value increases by 1cc, the odds of AD decreased by 1-30.58%=69.42%.

Confidence intervals

The confint function in R can be used to find confidence intervals for log-odds.

Code

confint.default(alzh.logistic)[2,]

    2.5 %    97.5 % 
-1.320255 -1.049144

We need to convert the results to odds

Code

exp(confint.default(alzh.logistic)[2,])

    2.5 %    97.5 % 
0.2670672 0.3502375

Confidence intervals

Code

1-exp(confint.default(alzh.logistic))[2,]

    2.5 %    97.5 % 
0.7329328 0.6497625

Recall that the estimated decrease of odds in AD associated with one-unit increase of hippocampal volume is 69.42%.
A 95% confidence interval is [64.98%, 73.29%].

P-value

Code

summary(alzh.logistic) %>% coefficients()

             Estimate Std. Error   z value     Pr(>|z|)
(Intercept)  6.093409 0.41017858  14.85550 6.408616e-50
hippo       -1.184699 0.06916233 -17.12926 8.979352e-66

The p-value is very small, about \(9.0\times 10^{-66}\)
- indicating to reject the null hypothesis \(H_0: \beta_1=0\)
- suggesting that AD is significantly associated with hippocampal volume.

Predicted Values vs Observed Values

Note, when using predict, it is necessary to specify the correct type of prediction.
- The default is on the scale of the linear predictors (log-odds).
- The response type gives predicted probability.

Predicted vs Observed

Code

plot(alzheimer_data$alzh, predict(alzh.logistic, type="response"),
     xlab="observed alzh", ylab="predicted alzh")

Predicted vs \(X\)

Code

plot(alzheimer_data$hippo, predict(alzh.logistic, type="response"),
     xlab="hippo volume", ylab="predicted alzh")

Ex 5: Logistic Regression with Multiple Variables

Similar to linear regression, we can include multiple explanatory variables in logistic regression.
Connect \(p\) (i.e., \(P(Y=1)\)) to a linear function of the predictors: \[log\frac{\hat p}{1-\hat p} = b_0 + b_1 \times x_{1} + … + b_p \times x_{p}\]

Example

Code

alzh.glm.multi = glm(alzh ~ age + female + educ, family=binomial, data=alzheimer_data)
summary(alzh.glm.multi)$coefficients

               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -2.02463880 0.455844285 -4.441514 8.932811e-06
age          0.04015417 0.005001781  8.027974 9.909571e-16
female1     -0.87341587 0.105761683 -8.258339 1.477131e-16
educ        -0.08759757 0.015192451 -5.765862 8.124175e-09

Interpretation

Consider the age variable. The estimated coefficient is 0.04015417. What information does it provide?
The estimated log-odds AD is \[logit(\hat p_0) = b_0 + b_{age} age + b_2 female + b_3 educ\]
Let \(\hat p_1\) denote estimated log-odds after one year \[logit(\hat p_1) = b_0 + b_{age} (age+1) + b_2 female + b_3 educ\]

Interpretation

The estimated change in log-odds \[logit(\hat p_1) - logit(\hat p_0)=log\frac{\hat p_1}{1-\hat p_1} -log \frac{\hat p_0}{1-\hat p_0}=0.040154178\]
Take exponential of both sides, we have \[\frac{\frac{\hat p_1}{1-\hat p_1}}{\frac{\hat p_0}{1-\hat p_0}} = exp(0.04015417)=104.1\%\]
The new odds is 104.1% of the old odds, i.e., the odds of AD increase 104.1% -1 = 4.1% with one year increase in age.

Interpretate Coefficients

The odds of AD in one year later is \(exp(0.04015417)=\) of the current odds.
When holding gender and education fixed, the estimated increase in odds of AD in a year is \(e^{0.04015417}-1=4.097\%\)
A 95% confidence interval is [3.08%, 5.12%].

Code

confint.default(alzh.glm.multi)[2,]

     2.5 %     97.5 % 
0.03035086 0.04995748

Code

exp(confint.default(alzh.glm.multi)[2,])-1

     2.5 %     97.5 % 
0.03081614 0.05122640

Interpretate Coefficients

What if we are interested in the increase in odds of AD in ten years (everything else is fixed)?
The estimated increase in odds of AD in 10 years is \[e^{10*0.04015417}-1=49.4\%\]
A \(95\%\) C.I. for 10-year increase in odds: [35.5%, 64.8%].

Code

10*confint.default(alzh.glm.multi)[2,]

    2.5 %    97.5 % 
0.3035086 0.4995748

Code

exp(10*confint.default(alzh.glm.multi)[2,])-1

    2.5 %    97.5 % 
0.3546032 0.6480204

Significance and P-value

Lastly, let’s take a look at significance.
All the three explanatory variables are significant.

Code

summary(alzh.glm.multi)$coefficients[-1,]

           Estimate  Std. Error   z value     Pr(>|z|)
age      0.04015417 0.005001781  8.027974 9.909571e-16
female1 -0.87341587 0.105761683 -8.258339 1.477131e-16
educ    -0.08759757 0.015192451 -5.765862 8.124175e-09