class: title-slide <br> <br> .right-panel[ <br> # Logistic Regression ### Jessica Jaynes ] <style type="text/css"> body, td { font-size: 14px; } code.r{ font-size: 20px; } pre { font-size: 20px } </style> --- ### Introduction - For linear regression models, the response variable, `\(Y\)`, is assumed to be a real-valued continuous random variable. - Now consider situations where the response variable is a binary random variable (e.g., disease status). - For such problems, it is common to use **logistic regression** instead: `$$\begin{eqnarray*} \log \Big(\frac{\hat{p}}{1- \hat{p}} \Big) & = & a + b_{1}x_1 + \ldots + b_{q}x_{q} \end{eqnarray*}$$` --- ### Logistic regression - Note that for binary random variables, we have `\(p = P(Y=1|X)\)`; that is, `\(p\)` is the probability of the outcome of interest (denoted as 1) given the explanatory variables. - The term `\(\Big(\frac{\hat{p}}{1- \hat{p}} \Big)\)` is called the **odds** of `\(Y=1\)`. - The term `\(\log \Big(\frac{\hat{p}}{1- \hat{p}} \Big)\)`, i.e., log of odds, is called the **logit** function. - Although `\(p\)` is a real number between 0 and 1, its logit transformation can be any real number between `\(-\infty\)` to `\(+\infty\)`. --- ### Logistic regression - We can exponentiate both sides, `$$\begin{eqnarray*} \frac{\hat{p}}{1- \hat{p}} & = & \exp(a + b_{1}x_1 + \ldots + b_{q}x_{q}) \end{eqnarray*}$$` - Then, we can find `\(\hat{p}\)` using the **logistic function**: `$$\begin{eqnarray*} \hat{p} & = & \frac{\exp(a + b_{1}x_1 + \ldots + b_{q}x_{q})}{1 + \exp(a + b_{1}x_1 + \ldots + b_{q}x_{q})} \end{eqnarray*}$$` --- ### Logistic Regression with One Binary Predictor - As an example, we use the `birthwt` data set to model the relationship between having low birthweight babies (a binary variable), `\(Y\)`, and smoking during pregnancy, `\(X\)`. - The binary variable `low` identifies low birthweight babies (low = 1 for low birthweight babies, and 0 otherwise). - The binary variable `smoke` identifies mothers who were smoking during pregnancy (smoke=1 for smoking during pregnancy, and 0 otherwise). --- ### Generalized linear model (glm) in R - We can use the `glm()` function in R to fit a regression model ```r library(MASS) data(birthwt) birthwt <- birthwt %>% mutate(across(c(low, smoke, race, ht, ui), factor)) glimpse(birthwt) ``` ``` ## Rows: 189 ## Columns: 10 ## $ low <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ age <int> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30, 18, 18, … ## $ lwt <int> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, 150, 95, 1… ## $ race <fct> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1, 3, 1, 3, 1… ## $ smoke <fct> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0… ## $ ptl <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0… ## $ ht <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ ui <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1… ## $ ftv <int> 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0, 1, 2, 3… ## $ bwt <int> 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663, 2665, 2722… ``` --- ### Generalized linear model (glm) in R ```r fit <- glm(low ~ smoke, family = 'binomial', data = birthwt) fit ``` ``` ## ## Call: glm(formula = low ~ smoke, family = "binomial", data = birthwt) ## ## Coefficients: ## (Intercept) smoke1 ## -1.0871 0.7041 ## ## Degrees of Freedom: 188 Total (i.e. Null); 187 Residual ## Null Deviance: 234.7 ## Residual Deviance: 229.8 AIC: 233.8 ``` --- ### Generalized linear model (glm) in R - We can use the `glm()` function in R to fit a regression model ```r confint(fit) ``` ``` ## 2.5 % 97.5 % ## (Intercept) -1.5243118 -0.679205 ## smoke1 0.0786932 1.335154 ``` --- ### Generalized linear model (glm) in R - We can use the `glm()` function in R to fit a regression model ```r summary(fit) ``` ``` ## ## Call: ## glm(formula = low ~ smoke, family = "binomial", data = birthwt) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.0871 0.2147 -5.062 4.14e-07 *** ## smoke1 0.7041 0.3196 2.203 0.0276 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 234.67 on 188 degrees of freedom ## Residual deviance: 229.80 on 187 degrees of freedom ## AIC: 233.8 ## ## Number of Fisher Scoring iterations: 4 ``` --- ### Estimation - For the above example, the estimated values of the intercept `\(\alpha\)` and the regression coefficient `\(\beta\)` are `\(a=-1.09\)` and `\(b=0.70\)` respectively. - Therefore, `$$\begin{eqnarray*} \frac{\hat{p}}{1- \hat{p}} & = & \exp(-1.09 + 0.70x) \end{eqnarray*}$$` - Here, `\(\hat{p}\)` is the estimated probability of having a low birthweight baby for a given `\(x\)`. - The left-hand side of the above equation is the estimated odds of having a low birthweight baby. --- ### Estimation - For non-smoking mother, `\(x=0\)`, the odds of having low birthweight baby is `$$\begin{eqnarray*} \frac{\hat{p}_{0}}{1- \hat{p}_{0}} & = & \exp(-1.09) \\ & = & 0.34 \end{eqnarray*}$$` - That is, the exponential of the intercept is the odds when `\(x=0\)`, which is sometimes referred to as the **baseline odds**. --- ### Estimation - For mothers who smoke during pregnancy, `\(x=1\)`, and `$$\begin{eqnarray*} \frac{\hat{p}_{1}}{1- \hat{p}_{1}} & = & \exp(-1.09 + 0.7)\\ & = & \exp(-1.09) \exp(0.7)\\ & = & 0.68 \end{eqnarray*}$$` - As we can see, corresponding to one unit increase in `\(x\)` from `\(x=0\)` (non-smoking) to `\(x=1\)` (smoking), the odds multiplicatively increases by the exponential of the regression coefficient. --- ### Interpretation - Note that `$$\begin{eqnarray*} \frac{\frac{\hat{p}_{1}}{1- \hat{p}_{1}}}{\frac{\hat{p}_{0}}{1- \hat{p}_{0}}} & = & \frac{\exp(-1.09) \exp(0.7)}{\exp(-1.09)} = \exp(0.7) = 2.01 \end{eqnarray*}$$` - We can interpret the exponential of the regression coefficient as the odds ratio of having low birthweight babies for smoking mothers compared to non-smoking mothers. - Here, the estimated odds ratio is `\(\exp(0.7) = 2.01\)` so the odds of having a low birthweight baby almost doubles for smoking mothers compared to non-smoking mothers. --- ### Interpretation - In general, - if `\(b>0\)`, then `\(\exp(b) > 1\)` so the odds increases as `\(X\)` increases; - if `\(b<0\)`, then `\(0 < \exp(b) < 1\)` so the odds decreases as `\(X\)` increases; - if `\(b=0\)`, the odds ratio is 1 so the odds does not change with `\(X\)` according to the assumed model. --- ### Prediction - We can use logistic regression models for predicting the unknown values of the response variable `\(Y\)` given the value of the predictor value `\(X\)`. `$$\begin{eqnarray*} \hat{p} & = & \frac{\exp(a + bx)}{1 + \exp(a + bx)} \end{eqnarray*}$$` - For the above example, `$$\begin{eqnarray*} \hat{p} & = & \frac{\exp(-1.09 + 0.70x)}{1 + \exp(-1.09 + 0.70x)} \end{eqnarray*}$$` --- ### Prediction - Therefore, the estimated probability of having a low birthweight baby for non-smoking mothers, `\(x=0\)`, is `$$\begin{eqnarray*} \hat{p} & = & \frac{\exp(-1.09)}{1 + \exp(-1.09)} = 0.25 \end{eqnarray*}$$` - This probability increases for mothers who smoke during pregnancy, `\(x=1\)`, `$$\begin{eqnarray*} \hat{p} & = & \frac{\exp(-1.09 + 0.7)}{1 + \exp(-1.09 + 0.7)} = 0.40 \end{eqnarray*}$$` - That is, the risk of having a low birthweight baby increases by 60% if a mother smokes during her pregnancy. --- ### Logistic Regression with One Numerical Predictor - For the most part, we follow similar steps to fit the model, estimate regression parameters, perform hypothesis testing, and predict unknown values of the response variable. - As an example, we want to investigate the relationship between having a low birthweight baby, `\(Y\)`, and mother's age at the time of pregnancy, `\(X\)`. --- ### Logistic Regression with One Numerical Predictor ```r fit <- glm(low ~ age, family = 'binomial', data = birthwt) fit ``` ``` ## ## Call: glm(formula = low ~ age, family = "binomial", data = birthwt) ## ## Coefficients: ## (Intercept) age ## 0.38458 -0.05115 ## ## Degrees of Freedom: 188 Total (i.e. Null); 187 Residual ## Null Deviance: 234.7 ## Residual Deviance: 231.9 AIC: 235.9 ``` --- ### Logistic Regression with One Numerical Predictor - Finding confidence intervals and performing hypothesis testing remain as before, so we focus on prediction and interpreting the point estimates. - For the above example, the point estimates for the regression parameters are `\(a=0.38\)` and `\(b=-0.05\)`. - While the intercept is the log odds when `\(x=0\)`, it is not reasonable to interpret its exponential as the baseline odds since mother's age cannot be zero. --- ### Logistic Regression with One Numerical Predictor - To interpret `\(b\)`, consider mothers whose age is 20 years old at the time of pregnancy, `$$\begin{eqnarray*} \log \Big(\frac{\hat{p}_{20}}{1- \hat{p}_{20}} \Big) & = & 0.38 - 0.05 \times 20\\ \frac{\hat{p}_{20}}{1- \hat{p}_{20}} & = & \exp(0.38 - 0.05 \times 20)\\ & = & \exp(0.38) \exp(- 0.05 \times 20) \end{eqnarray*}$$` --- ### Logistic Regression with One Numerical Predictor - For mothers who are one year older (i.e., one unit increase in age), we have `$$\begin{eqnarray*} \log \Big(\frac{\hat{p}_{21}}{1- \hat{p}_{21}} \Big) & = & 0.38 - 0.05 \times 21\\ \frac{\hat{p}_{21}}{1- \hat{p}_{21}} & = & \exp(0.38 - 0.05 \times 21)\\ & = & \exp(0.38) \exp(- 0.05 \times 21) \end{eqnarray*}$$` --- ### Logistic Regression with One Numerical Predictor - The odds ratio for comparing 21 year old mothers to 20 year old mothers is `$$\begin{eqnarray*} \frac{\frac{\hat{p}_{21}}{1- \hat{p}_{21}}}{\frac{\hat{p}_{20}}{1- \hat{p}_{20}}} & = & \frac{ \exp(0.38) \exp(- 0.05 \times 21))}{ \exp(0.38) \exp(- 0.05 \times 20)}\\ & = & \exp(- 0.05 \times 21 + 0.05 \times 20)\\ & = & \exp(- 0.05) \end{eqnarray*}$$` - Therefore, `\(\exp(b)\)` is the estimated odds ratio comparing 21 year old mothers to 20 year old mothers. --- ### Logistic Regression with One Numerical Predictor - In general, `\(\exp(b)\)` is the estimated odds ratio for comparing two subpopulations, whose predictor values are `\(x+1\)` and `\(x\)`, `$$\begin{eqnarray*} \frac{\frac{\hat{p}_{x+1}}{1- \hat{p}_{x+1}}}{\frac{\hat{p}_{x}}{1- \hat{p}_{x}}} & = & \exp(b) \end{eqnarray*}$$` --- ### Logistic Regression with One Numerical Predictor - As before, we can use the estimated regression parameters to find `\(\hat{p}\)` and predict the unknown value of the response variable. `$$\begin{eqnarray*} \hat{p} & = & \frac{\exp(a + bx)}{1 + \exp(a + bx)} & = & \frac{\exp(0.38 -0.05x)}{1 + \exp(0.38 -0.05x)}. \end{eqnarray*}$$` - For example, for mother who are 20 years old at the time of pregnancy, the estimated probability of having a low birthweight baby is `$$\begin{eqnarray*} \hat{p} & = & \frac{\exp(0.38 -0.05 \times 20)}{1 + \exp(0.38 -0.05 \times 20)} = 0.35. \end{eqnarray*}$$` --- ### Logistic Regression with Multiple Variables - Including multiple explanatory variables (predictors) in a logistic regression model is easy. - Similar to linear regression models, we specify the model formula by entering the response variable on the left side of the "~" symbol and the explanatory variables (separated by "+" sings) on the right side. --- ### Logistic Regression with Multiple Variables ```r fit <- glm(low ~ age + smoke, family = 'binomial', data = birthwt) fit ``` ``` ## ## Call: glm(formula = low ~ age + smoke, family = "binomial", data = birthwt) ## ## Coefficients: ## (Intercept) age smoke1 ## 0.06091 -0.04978 0.69185 ## ## Degrees of Freedom: 188 Total (i.e. Null); 186 Residual ## Null Deviance: 234.7 ## Residual Deviance: 227.3 AIC: 233.3 ``` --- ### Logistic Regression with Multiple Variables ```r confint(fit) ``` ``` ## 2.5 % 97.5 % ## (Intercept) -1.41611481 1.56446444 ## age -0.11450032 0.01132921 ## smoke1 0.06214062 1.32718026 ``` --- ### Logistic Regression with Multiple Variables ```r summary(fit) ``` ``` ## ## Call: ## glm(formula = low ~ age + smoke, family = "binomial", data = birthwt) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.06091 0.75732 0.080 0.9359 ## age -0.04978 0.03197 -1.557 0.1195 ## smoke1 0.69185 0.32181 2.150 0.0316 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 234.67 on 188 degrees of freedom ## Residual deviance: 227.28 on 186 degrees of freedom ## AIC: 233.28 ## ## Number of Fisher Scoring iterations: 4 ```