class: title-slide <br> <br> .right-panel[ # Logistic Regression in R ## Zahra Moslemi Adapted from slides by Mine Dogucu and Sam Behseta ] <style type="text/css"> body, td { font-size: 14px; } code.r{ font-size: 20px; } pre { font-size: 20px } </style> --- * Remember from the lecture that we are fitting a regression model with a binary outcome. * As such, the model is as follows: `$$\begin{eqnarray*} \log \Big(\frac{\hat{p}}{1- \hat{p}} \Big) & = & a + b_{1}x_1 + \ldots + b_{q}x_{q} \end{eqnarray*}$$` * The left hand side of this model is the logarithm of the odds of success. * Thereby, the probability of success of `\(\pi\)` can be written as follows: `$$\begin{eqnarray*} \hat{p} & = & \frac{\exp(a + b_{1}x_1 + \ldots + b_{q}x_{q})}{1 + \exp(a + b_{1}x_1 + \ldots + b_{q}x_{q})} \end{eqnarray*}$$` * The above means once we estimate the coefficients of the model, we can estimate the probability of success of the outcome of interest. --- * Let's revisit Alzheimer's data set, and consider the task of building a logistic regression model with diagnosis as its response variable and variables age, education, naccicv, and female as its predictors. * Let's begin by transforming the response to a new feature with two categories: no symptoms (0) versus mild or strong symptoms (1): ```r alzheimer_data <- alzheimer_data %>% mutate(diag = ifelse(diagnosis %in% c(1, 2), "1", "0"), diag = as.factor(diag)) ``` * Running a logistic regression model in R is pretty straightforward. Before we do that, we should notice female is a binary variable as well. As such, we should make sure R recognizes that feature as a factor variable. ```r alzheimer_data <- alzheimer_data %>% mutate(female=as.factor(female)) ``` --- ```r logistic_model <-glm(diag ~ educ + age + naccicv + female, family=binomial, data=alzheimer_data) summary(logistic_model) ``` ``` ## ## Call: ## glm(formula = diag ~ educ + age + naccicv + female, family = binomial, ## data = alzheimer_data) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.1637929 0.6481630 -1.796 0.0726 . ## educ -0.0707890 0.0128192 -5.522 3.35e-08 *** ## age 0.0437751 0.0039941 10.960 < 2e-16 *** ## naccicv -0.0004503 0.0003726 -1.209 0.2268 ## female1 -0.8928635 0.0989969 -9.019 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 3692.7 on 2699 degrees of freedom ## Residual deviance: 3395.6 on 2695 degrees of freedom ## AIC: 3405.6 ## ## Number of Fisher Scoring iterations: 4 ``` --- ```r glm(diag ~ educ + age + naccicv + female, family=binomial, data=alzheimer_data) %>% tbl_regression(estimate_fun = function(x) style_number(x, digits = 3), exponentiate = TRUE) ```
Characteristic
OR
1
95% CI
1
p-value
educ
0.932
0.908, 0.955
<0.001
age
1.045
1.037, 1.053
<0.001
naccicv
1.000
0.999, 1.000
0.2
female
0
—
—
1
0.409
0.337, 0.497
<0.001
1
OR = Odds Ratio, CI = Confidence Interval
--- ### Logistic Regression Model Evaluation: To split the data into training and validation sets using the rsample package in R, you can use the initial_split() function. Here's an example of how you can split the data: ```r library(rsample) set.seed(0) data_split <- initial_split(alzheimer_data, prop = 0.7) train_data <- training(data_split) test_data <- testing(data_split) ``` #### As we saw, next step after splitting the data into train and test would be training the model using training data: ```r logistic_model2 <- glm(diag ~ educ + age + naccicv + female, family=binomial, data=train_data) summary(logistic_model2) ``` ``` ## ## Call: ## glm(formula = diag ~ educ + age + naccicv + female, family = binomial, ## data = train_data) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.9952282 0.7730543 -1.287 0.198 ## educ -0.0759332 0.0151489 -5.012 5.37e-07 *** ## age 0.0430518 0.0047803 9.006 < 2e-16 *** ## naccicv -0.0004883 0.0004437 -1.101 0.271 ## female1 -0.9058400 0.1191447 -7.603 2.90e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2578.5 on 1888 degrees of freedom ## Residual deviance: 2371.0 on 1884 degrees of freedom ## AIC: 2381 ## ## Number of Fisher Scoring iterations: 4 ``` --- ```r glm(diag ~ educ + age + naccicv + female, family=binomial, data=train_data) %>% tbl_regression(estimate_fun = function(x) style_number(x, digits = 3), exponentiate = TRUE) ```
Characteristic
OR
1
95% CI
1
p-value
educ
0.927
0.900, 0.955
<0.001
age
1.044
1.034, 1.054
<0.001
naccicv
1.000
0.999, 1.000
0.3
female
0
—
—
1
0.404
0.320, 0.510
<0.001
1
OR = Odds Ratio, CI = Confidence Interval
--- * Followed, by testing it via the validation set. This means to calculate the probability of success for each subject in the test set: ```r pred_prob <- logistic_model2 %>% predict(test_data,type="response") ``` * We are now ready to calculate the accuracy of our trained model. To accomplish this, we translate all probabilities of success above to 0.5 to a 1 and otherwise to a 0, followed by tracking the number of correct predictions (1's correctly predicted as 1's and 0's correctly predicted as 0's). ```r predicted.classes <- ifelse(pred_prob > 0.5, "1", "0") mean(predicted.classes == test_data$diag) ``` ``` ## [1] 0.6510481 ``` * This model yields a 65% accuracy rate!