Logistic Regression

class: title-slide

<br>
<br>
.right-panel[ 
<br>

# Logistic Regression

### Zhaoxia Yu

]

body, td {
   font-size: 14px;
}
code.r{
  font-size: 20px;
}
pre {
  font-size: 20px
}
</style>

---

### Introduction

- For linear regression models, the response variable, `$Y$`, is assumed to be a real-valued continuous random variable.

- Now consider situations where the response variable is a binary random variable (e.g., disease status).

- For such problems, it is common to use **logistic regression** instead:
`$$\begin{eqnarray*}
\log \Big(\frac{\hat{p}}{1- \hat{p}} \Big) & = & a + b_{1}x_1 + \ldots + b_{q}x_{q}
\end{eqnarray*}$$`

---

### Logistic regression

- Note that for binary random variables, we have `$p = P(Y=1|X)$`; that is, `$p$` is the probability of the outcome of interest (denoted as 1) given the explanatory variables.

- The term `$\Big(\frac{\hat{p}}{1- \hat{p}} \Big)$` is called the **odds** of `$Y=1$`.

- The term `$\log \Big(\frac{\hat{p}}{1- \hat{p}} \Big)$`, i.e.,  log of odds, is called the **logit** function.

- Although `$p$` is a real number between 0 and 1, its logit transformation can be any real number between `$-\infty$` to `$+\infty$`.

---

### Logistic regression

- We can exponentiate both sides,
`$$\begin{eqnarray*}
\frac{\hat{p}}{1- \hat{p}}  & = & \exp(a + b_{1}x_1 + \ldots + b_{q}x_{q})
\end{eqnarray*}$$`

- Then, we can find `$\hat{p}$` using the **logistic function**:
`$$\begin{eqnarray*}
\hat{p} & = & \frac{\exp(a + b_{1}x_1 + \ldots + b_{q}x_{q})}{1 + \exp(a + b_{1}x_1 + \ldots + b_{q}x_{q})}
\end{eqnarray*}$$`

---

### Logistic Regression with One Binary Predictor

- As an example, we use the `birthwt` data set to model the relationship between having low birthweight babies (a binary variable), `$Y$`, and smoking during pregnancy, `$X$`.

- The binary variable `low` identifies low birthweight babies (low = 1 for low birthweight babies, and 0 otherwise).

- The binary variable `smoke` identifies mothers who were smoking during pregnancy (smoke=1 for smoking during pregnancy, and 0 otherwise).

---

### Generalized linear model (glm) in R

- We can use the `glm()` function in R to fit a regression model

``` r
library(MASS)
data(birthwt)
birthwt <- birthwt %>% 
  mutate(across(c(low, smoke, race, ht, ui), factor))
glimpse(birthwt)
```

```
## Rows: 189
## Columns: 10
## $ low   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ age   <int> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30, 18, 18, …
## $ lwt   <int> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, 150, 95, 1…
## $ race  <fct> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1, 3, 1, 3, 1…
## $ smoke <fct> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0…
## $ ptl   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ht    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ui    <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1…
## $ ftv   <int> 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0, 1, 2, 3…
## $ bwt   <int> 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663, 2665, 2722…
```

---

### Generalized linear model (glm) in R

``` r
fit <- glm(low ~ smoke, family = 'binomial', data = birthwt)
fit
```

```
## 
## Call:  glm(formula = low ~ smoke, family = "binomial", data = birthwt)
## 
## Coefficients:
## (Intercept)       smoke1  
##     -1.0871       0.7041  
## 
## Degrees of Freedom: 188 Total (i.e. Null);  187 Residual
## Null Deviance:	    234.7 
## Residual Deviance: 229.8 	AIC: 233.8
```

---

### Generalized linear model (glm) in R

- We can use the `glm()` function in R to fit a regression model

``` r
confint(fit)
```

```
##                  2.5 %    97.5 %
## (Intercept) -1.5243118 -0.679205
## smoke1       0.0786932  1.335154
```

---

### Generalized linear model (glm) in R

- We can use the `glm()` function in R to fit a regression model

``` r
summary(fit)
```

```
## 
## Call:
## glm(formula = low ~ smoke, family = "binomial", data = birthwt)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0871     0.2147  -5.062 4.14e-07 ***
## smoke1        0.7041     0.3196   2.203   0.0276 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 229.80  on 187  degrees of freedom
## AIC: 233.8
## 
## Number of Fisher Scoring iterations: 4
```

---

### Estimation

- For the above example, the estimated values of the intercept `$\alpha$` and the regression coefficient `$\beta$` are `$a=-1.09$` and `$b=0.70$` respectively.

- Therefore,

`$$\begin{eqnarray*}
\frac{\hat{p}}{1- \hat{p}} & = & \exp(-1.09 + 0.70x)
\end{eqnarray*}$$`

- Here, `$\hat{p}$` is the estimated probability of having a low birthweight baby for a given `$x$`.

- The left-hand side of the above equation is the estimated odds of having a low birthweight baby.

---

### Estimation

- For non-smoking mother, `$x=0$`, the odds of having low birthweight baby is 
`$$\begin{eqnarray*}
\frac{\hat{p}_{0}}{1- \hat{p}_{0}} & = & \exp(-1.09) \\
& = & 0.34
\end{eqnarray*}$$`

- That is, the exponential of the intercept is the odds when `$x=0$`, which is sometimes referred to as the **baseline odds**.

---

### Estimation

- For mothers who smoke during pregnancy, `$x=1$`, and
`$$\begin{eqnarray*}
\frac{\hat{p}_{1}}{1- \hat{p}_{1}} & = &  \exp(-1.09 + 0.7)\\
& = &  \exp(-1.09) \exp(0.7)\\
& = & 0.68
\end{eqnarray*}$$`

- As we can see, corresponding to one unit increase in `$x$` from `$x=0$` (non-smoking) to `$x=1$` (smoking), the odds multiplicatively increases by the exponential of the regression coefficient.

---

### Interpretation

- Note that

`$$\begin{eqnarray*}
\frac{\frac{\hat{p}_{1}}{1- \hat{p}_{1}}}{\frac{\hat{p}_{0}}{1- \hat{p}_{0}}} & = &  \frac{\exp(-1.09) \exp(0.7)}{\exp(-1.09)} =  \exp(0.7) = 2.01
\end{eqnarray*}$$`

- We can interpret the exponential of the regression coefficient as the odds ratio of having low birthweight babies for smoking mothers compared to non-smoking mothers.

- Here, the estimated odds ratio is `$\exp(0.7) = 2.01$` so the odds of having a low birthweight baby almost doubles for smoking mothers compared to non-smoking mothers.

---

### Interpretation

- In general,

- if `$b>0$`, then `$\exp(b) > 1$` so the odds increases as `$X$` increases;
  
  - if `$b<0$`, then `$0 < \exp(b) < 1$` so the odds decreases as `$X$` increases;
  
  - if `$b=0$`, the odds ratio is 1 so the odds does not change with `$X$` according to the assumed model.

---

### Prediction

- We can use logistic regression models for predicting the unknown values of the response variable `$Y$` given the value of the predictor value `$X$`.

`$$\begin{eqnarray*}
\hat{p} & = & \frac{\exp(a + bx)}{1 + \exp(a + bx)}
\end{eqnarray*}$$`

- For the above example,
`$$\begin{eqnarray*}
\hat{p} & = & \frac{\exp(-1.09 + 0.70x)}{1 + \exp(-1.09 + 0.70x)}
\end{eqnarray*}$$`

---

### Prediction

- Therefore, the estimated probability of having a low birthweight baby for non-smoking mothers, `$x=0$`, is
`$$\begin{eqnarray*}
\hat{p} & = & \frac{\exp(-1.09)}{1 + \exp(-1.09)} = 0.25
\end{eqnarray*}$$`

- This probability increases for mothers who smoke during pregnancy, `$x=1$`,
`$$\begin{eqnarray*}
\hat{p} & = & \frac{\exp(-1.09 + 0.7)}{1 + \exp(-1.09 + 0.7)} = 0.40
\end{eqnarray*}$$`

- That is, the risk of having a low birthweight baby increases by 60% if a mother smokes during her pregnancy.

---