class: title-slide <br> <br> .right-panel[ # Linear Regression in R ## Dr. Mine Dogucu ] <style type="text/css"> body, td { font-size: 14px; } code.r{ font-size: 20px; } pre { font-size: 20px } </style> --- class:inverse middle .font180[Linear Regression] --- class: middle * Now that you have learned about regression models, we will build a multiple regression model for predicting the left hippocampus volume of the brain, labeled as **lhippo**, through two predictors, namely **age** and **educ**. * Remember that in the general, a multiple linear regression model with `\(p\)` explanatory variables can be presented as follows: `$$\begin{equation*} \hat{y} = a + b_{1}x_{1} + b_{2}x_{2} + \cdots + b_{p}x_{p}. \end{equation*}$$` * The left hand side of this model is the response variable, a numerical continuous variable. --- class: middle Recall the left hippocampus volume **lhippo** is likely to shrink as Alzheimer's severs. Also, from Yueqi's introduction, while the progress of the disease is a function of age, it is possible that education can have a reverse effect on the progress of the disease. To fit linear models all we need to do is to apply the **lm()** command in R. We begin with plotting the response versus each predictor, separately. --- ``` r ggplot(data = alzheimer_data) + geom_point(aes(x = age, y = lhippo), color = "red") + labs(x = "Age", y = "left hippo") + theme_minimal() ``` ``` r ggplot(data = alzheimer_data) + geom_point(aes(x = educ, y = lhippo), color = "red") + labs(x = "Education", y = "left hippo") + theme_minimal() ``` --- <img src="Lab-01-linear-regression_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- Here is the regression of lhippo versus age: ``` r lm_model <- lm(lhippo ~ age, data = alzheimer_data) summary(lm_model) ``` ``` ## ## Call: ## lm(formula = lhippo ~ age, data = alzheimer_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.58855 -0.28598 0.01999 0.31504 1.58641 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.0639626 0.0543632 74.76 <2e-16 *** ## age -0.0149051 0.0007657 -19.46 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4593 on 2698 degrees of freedom ## Multiple R-squared: 0.1231, Adjusted R-squared: 0.1228 ## F-statistic: 378.9 on 1 and 2698 DF, p-value: < 2.2e-16 ``` --- ``` r lm(lhippo ~ age, data = alzheimer_data) %>% tbl_regression(estimate_fun = function(x) style_number(x, digits = 3)) ```
Characteristic
Beta
95% CI
1
p-value
age
-0.015
-0.016, -0.013
<0.001
1
CI = Confidence Interval
--- Let's see the fitted line: ``` r ggplot(data = alzheimer_data, aes(x = age, y = lhippo)) + geom_point(color = "red") + geom_smooth(method = "lm", color = "blue", se=FALSE) + labs(x = "Age", y = "Left Hippocampus Volume") ``` <img src="Lab-01-linear-regression_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- Here is the regression of lhippo versus education: ``` r lm_model <- lm(lhippo ~ educ, data = alzheimer_data) summary(lm_model) ``` ``` ## ## Call: ## lm(formula = lhippo ~ educ, data = alzheimer_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.45617 -0.30433 0.01738 0.33743 1.77443 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.647647 0.042951 61.644 <2e-16 *** ## educ 0.024351 0.002743 8.877 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4835 on 2698 degrees of freedom ## Multiple R-squared: 0.02838, Adjusted R-squared: 0.02802 ## F-statistic: 78.8 on 1 and 2698 DF, p-value: < 2.2e-16 ``` --- ``` r lm(lhippo ~ educ, data = alzheimer_data) %>% tbl_regression() ```
Characteristic
Beta
95% CI
1
p-value
educ
0.02
0.02, 0.03
<0.001
1
CI = Confidence Interval
--- ``` r lm(lhippo ~ educ, data = alzheimer_data) %>% tbl_regression(estimate_fun = function(x) style_number(x, digits = 3)) ```
Characteristic
Beta
95% CI
1
p-value
educ
0.024
0.019, 0.030
<0.001
1
CI = Confidence Interval
``` r lm(lhippo ~ educ, data = alzheimer_data) %>% tbl_regression(estimate_fun = function(x) round(x, digits = 3)) ```
Characteristic
Beta
95% CI
1
p-value
educ
0.024
0.019, 0.03
<0.001
1
CI = Confidence Interval
--- Let's see the fitted line: ``` r ggplot(data = alzheimer_data, aes(x = educ, y = lhippo)) + geom_point(color = "red") + geom_smooth(method = "lm", color = "blue", se=FALSE) + labs(x = "Education", y = "Left Hippocampus Volume") ``` <img src="Lab-01-linear-regression_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- Here is the regression of lhippo versus age and education: ``` r lm_model <- lm(lhippo ~ age + educ, data = alzheimer_data) summary(lm_model) ``` ``` ## ## Call: ## lm(formula = lhippo ~ age + educ, data = alzheimer_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.59525 -0.28746 0.01681 0.31416 1.54719 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.7348265 0.0709453 52.644 < 2e-16 *** ## age -0.0142527 0.0007643 -18.649 < 2e-16 *** ## educ 0.0185428 0.0026010 7.129 1.29e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4551 on 2697 degrees of freedom ## Multiple R-squared: 0.1394, Adjusted R-squared: 0.1387 ## F-statistic: 218.4 on 2 and 2697 DF, p-value: < 2.2e-16 ``` --- ``` r lm(lhippo ~ age + educ, data = alzheimer_data) %>% tbl_regression(estimate_fun = function(x) style_number(x, digits = 3)) ```
Characteristic
Beta
95% CI
1
p-value
age
-0.014
-0.016, -0.013
<0.001
educ
0.019
0.013, 0.024
<0.001
1
CI = Confidence Interval
--- class:inverse middle .font80[Cross-Validation] --- class: middle * Let's try to evaluate the performance of our model by calculating its accuracy or mean squared error (MSE), depending on whether we are dealing with a linear regression or logistic regression model. * Cross-validation is a commonly used technique to achieve this evaluation. It is an old approach, devised by statisticians Fred Mosteller and John Tukey in 1968. * The process involves splitting the data into training and validation (or test) sets. We then fit or train the model using the training portion of the dataset. The accuracy or MSE is then calculated by comparing the predictions made by the model on the validation set to the actual values. --- class: middle * For linear regression models, we typically use the mean squared error (MSE) to measure the quality of predictions. The lower the MSE, the better the model's performance. The MSE represents the average squared difference between the predicted values and the true values. * For classification problems (e.g., logistic regression), we use accuracy as a metric to assess the model's performance. Accuracy represents the proportion of correctly classified instances out of the total instances in the validation set. The higher the accuracy, the better the model's performance. --- To split the data into training and validation sets using the rsample package in R, you can use the initial_split() function. Here's an example of how you can split the data: ``` r library(rsample) set.seed(0) data_split <- initial_split(alzheimer_data, prop = 0.7) train_data <- training(data_split) test_data <- testing(data_split) ``` --- ### Linear Regression Model Evaluation: #### After splitting the data into train and test, we train the model using training data: ``` r lm_model <- lm(lhippo ~ age + educ, data = train_data) summary(lm_model) ``` ``` ## ## Call: ## lm(formula = lhippo ~ age + educ, data = train_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.59635 -0.28271 0.01207 0.30870 1.55249 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.7664975 0.0837727 44.961 < 2e-16 *** ## age -0.0140749 0.0009099 -15.468 < 2e-16 *** ## educ 0.0159653 0.0030390 5.254 1.66e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4524 on 1886 degrees of freedom ## Multiple R-squared: 0.1329, Adjusted R-squared: 0.132 ## F-statistic: 144.5 on 2 and 1886 DF, p-value: < 2.2e-16 ``` --- ``` r lm(lhippo ~ age + educ, data = train_data) %>% tbl_regression(estimate_fun = function(x) style_number(x, digits = 3)) ```
Characteristic
Beta
95% CI
1
p-value
age
-0.014
-0.016, -0.012
<0.001
educ
0.016
0.010, 0.022
<0.001
1
CI = Confidence Interval
--- Now, let's use the trained model to make predictions on the validation data: ``` r predictions <- predict(lm_model, newdata = test_data) ``` --- To evaluate the performance of our model, we can calculate the Mean Squared Error (MSE) between the predicted values and the actual values: ``` r mean((test_data$lhippo - predictions)^2) ``` ``` ## [1] 0.2133323 ``` Question: What happen to MSE when we change the ratio we split the data?