class: title-slide <br> <br> .right-panel[ <br> # Examining Relationships ### Jessica Jaynes ] <style type="text/css"> body, td { font-size: 14px; } code.r{ font-size: 20px; } pre { font-size: 20px } </style> --- ### Objective - We now discuss exploring and examining possible relationships between two variables. - We first focus on problems where we are investigating the relationship between one binary categorical variable (e.g., gender) and one numerical variable (e.g., body temperature). - Next, we examine the relationship between two numerical variables (e.g., years of education and income). - Finally, we discuss the relationship between two categorical variables (e.g., treatment and survival status). --- ### Relationship Between a Numerical Variable and a Binary Variable - In these situations, the binary variable typically represents two different groups or two different experimental conditions. - We treat the binary variable (factor) as the explanatory variable in our analysis. - The numerical variable, on the other hand, is regarded as the response (target) variable (e.g., body temperature). --- ### Relationship Between a Numerical Variable and a Binary Variable <img src="img/cabbages1.png" width="45%" height="30%" style="display: block; margin: auto;" /> Dot plots of vitamin C content (numerical) by cultivar (categorical) for the `cabbages` data set from the `MASS` package. --- ### Relationship Between a Numerical Variable and a Binary Variable A more common way of visualizing the relationship between a numerical variable and a categorical variable is to create boxplots. <img src="img/boxVitCbyCult.png" width="35%" height="20%" style="display: block; margin: auto;" /> --- ### Relationship Between a Numerical Variable and a Binary Variable - In general, we say that two variables are related if the distribution of one of them changes as the other one varies. - We can measure changes in the distribution of the numerical variable by obtaining its summary statistics for different levels of the categorical variable. - it is common to use the __difference of means__ when examining the relationship between a numerical variable and a categorical variable. - In the above example, the difference of means of vitamin C content is `\(64.4 - 51.5 = 12.9\)` between the two cultivars. Is this difference __significant__? --- ### Two sample t-test - In general, we can denote the population means of two groups as `\(\mu_{1}\)` and `\(\mu_{2}\)`. - The null hypothesis indicates that the population means are equal, `\(H_{0}: \mu_{1} = \mu_{2}\)`. - In contrast, the alternative hypothesis is one the following: `$$\begin{array}[t]{l@{\quad}p{6.7cm}} H_{A}: \mu_{1} > \mu_{2} \\ H_{A}: \mu_{1} < \mu_{2} \\ H_{A}: \mu_{1} \ne \mu_{2} \\ \end{array}$$` --- ### Two sample t-test - We can also express these hypotheses in terms of the *difference* in the means: `$$\begin{array}[t]{l@{\quad}p{6.7cm}} H_{A}: \mu_{1} - \mu_{2} > 0 \\ H_{A}: \mu_{1} - \mu_{2} < 0 \\ H_{A}: \mu_{1} - \mu_{2} \ne 0 \\ \end{array}$$` - Then the corresponding null hypothesis is that there is no difference in the population means, `$$H_{0}: \mu_{1} - \mu_{2} = 0$$` --- ### Two sample t-test - Previously, we used the sample mean `\(\bar{X}\)` to perform statistical inference regarding the population mean `\(\mu\)`. - To evaluate our hypothesis regarding the difference between two means, `\(\mu_{1} - \mu_{2}\)`, it is reasonable to examine the difference between the sample means, `\(\bar{X}_{1} - \bar{X}_{2}\)`, as our test statistic. - For this, we can simply use the `t.test()` function in R. --- ### Two sample t-test ```r t.test(VitC ~ Cult, data=cabbages) ``` ``` ## ## Welch Two Sample t-test ## ## data: VitC by Cult ## t = -6.3909, df = 56.376, p-value = 3.405e-08 ## alternative hypothesis: true difference in means between group c39 and group c52 is not equal to 0 ## 95 percent confidence interval: ## -16.94296 -8.85704 ## sample estimates: ## mean in group c39 mean in group c52 ## 51.5 64.4 ``` --- ### Paired t-test - While we hope that the two samples taken from the population are comparable except for the characteristic that defines the grouping, this is not guaranteed in general. - To mitigate the influence of other important factors (e.g., age) that are not the focus of our study, we sometimes **pair** (match) each individual in one group with an individual in the other group so that the paired individuals are very similar to each other except for the variable that defines the grouping. - For example, we might recruit twins and assign one of them to the treatment group and the other one to the placebo group. - Sometimes, the subjects in the two groups are the same individuals under two different conditions. --- ### Paired t-test - When the individuals in the two groups are paired, we use the **paired** `\(t\)`-test to take the pairing of the observations between the two groups into account. - Using the difference, `\(D\)`, between the paired observations, the hypothesis testing problem reduces to a single sample `\(t\)`-test problem. - In practice, we can use the function `t.test()` with the option `paired=TRUE`. --- ### Paired t-test - As an example, we use the study of the effect of tobacco smoke on platelet function by Levine (1973). - In his study, for a group of eleven people, platelet aggregation was measured before and after smoking a cigarette. - Therefore, observations in the `Before` sample and `After` sample are from the same subjects. - For each subject, an observation in the `Before` sample is paired with an observation in the `After` sample. --- ### Paired t-test ```r glimpse(Platelet) ``` ``` ## Rows: 11 ## Columns: 2 ## $ Before <int> 25, 25, 27, 44, 30, 67, 53, 53, 52, 60, 28 ## $ After <int> 27, 29, 37, 56, 46, 82, 57, 80, 61, 59, 43 ``` --- ### Paired t-test ```r t.test(Platelet$Before, Platelet$After, paired = TRUE) ``` ``` ## ## Paired t-test ## ## data: Platelet$Before and Platelet$After ## t = -4.2716, df = 10, p-value = 0.001633 ## alternative hypothesis: true mean difference is not equal to 0 ## 95 percent confidence interval: ## -15.63114 -4.91431 ## sample estimates: ## mean difference ## -10.27273 ``` --- ### Paired t-test See what happens if we fail to account for the pairing of observations! ```r t.test(Platelet$Before, Platelet$After) ``` ``` ## ## Welch Two Sample t-test ## ## data: Platelet$Before and Platelet$After ## t = -1.4164, df = 19.516, p-value = 0.1724 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -25.425913 4.880458 ## sample estimates: ## mean of x mean of y ## 42.18182 52.45455 ``` --- ### Two numerical variables - A simple way to visualize the relationship between two numerical variables is with a __scatterplot__. - As our first example, we use the `bodyFat` data: http://lib.stat.cmu.edu/datasets/bodyfat. - Suppose that we are interested in examining the relationship between percent body fat (`siri`) and abdomen circumference (`abdomen`) among men. --- ### Scatterplot The plot suggests that the increase in percent body fat tends to coincide with the increase in abdomen circumference. <img src="img/scatterPercAb.png" width="40%" height="20%" style="display: block; margin: auto;" /> --- ### Scatterplot Next, we examine the relationship between the annual mortality rate due to malignant melanoma for US states and the latitude of their centers. <img src="img/latMelanoma.png" width="40%" height="20%" style="display: block; margin: auto;" /> --- ### Scatterplot - Using scatterplots, we could detect possible relationships between two numerical variables. - In above examples, we can see that changes in one variable coincides with substantial __systematic__ changes (increase or decrease) in the other variable. - Since the overall relationship can be presented by a straight line, we say that the two variables have __linear relationship__. - We say that percent body fat and abdomen circumference have __positive linear relationship__. - In contrast, we say that annual mortality rate due to malignant melanoma and latitude have __negative linear relationship__. --- ### Correlation - To quantify the strength and direction of _linear_ relationship between two numerical variables, we can use __Pearson's correlation coefficient__, `\(r\)`, as a summary statistic. - The value of `\(r\)` is always between `\(-1\)` and `\(+1\)` and the relationship is strong when `\(r\)` approaches `\(-1\)` or `\(+1\)`. - The sign of `\(r\)` shows the direction (negative or positive) of the linear relationship. - For observed pairs of values, `\((x_{1}, y_{1}), (x_{1},y_{1}), \ldots, (x_{n}, y_{n})\)`, `$$\begin{eqnarray*} r_{xy} = \frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}- \bar{y})}{(n-1)s_{x}s_{y}} \end{eqnarray*}$$` --- ### Correlation <img src="img/corr1.png" width="60%" height="20%" style="display: block; margin: auto;" /> --- ### Correlation <img src="img/corr2.png" width="70%" height="20%" style="display: block; margin: auto;" /> --- ### Correlation - We can examine whether the correlation is statistically significant or not using the `cor.test()` function in R. - The following code tests whether the correlation coefficient between `siri` and `abdomon` is greater than zero. --- ### Correlation ```r data(bodyfat, package="mfp") bodyfat$abdomen = bodyfat$abdomen *.39 cor.test(bodyfat$siri, bodyfat$abdomen, alternative = "greater") ``` ``` ## ## Pearson's product-moment correlation ## ## data: bodyfat$siri and bodyfat$abdomen ## t = 22.112, df = 250, p-value < 2.2e-16 ## alternative hypothesis: true correlation is greater than 0 ## 95 percent confidence interval: ## 0.77505 1.00000 ## sample estimates: ## cor ## 0.8134323 ``` --- ### Correlation Later, we will discuss more advanced models for examining such relationships using linear regression models. --- ### Two categorical variables - We now discuss techniques for exploring relationships between categorical variables. - As an example, we consider the five-year study to investigate whether regular aspirin intake reduces the risk of cardiovascular disease. - We usually use __contingency tables__ to summarize such data. <img src="img/aspirin.png" width="70%" height="20%" style="display: block; margin: auto;" /> --- ### Two categorical variables - Each cell shows the frequency of one possible combination of disease status (heart attack or no heart attack) and experiment group (placebo or aspirin). - Using these frequencies, we can calculate the __sample proportion__ of people who suffered from heart attack in each experiment group separately. - There were 11034 people in the placebo group, of which 189 had heart attack. The proportion of people suffered from a heart attack in the placebo group is therefore `\(p_1 = {189}/{11034} = 0.0171\)`. - The proportion of people suffered from heart attack in the aspirin group is `\(p_2 = {104}/{11037} = 0.0094\)`. --- ### Two categorical variables - Here, we refer to these proportions as the __risk__ of heart attack for the two groups. - Substantial difference between the sample proportion of heart attack between the two experiment groups could lead us to believe that the treatment and disease status are related. - A common summary statistic for comparing sample proportions is the __relative proportion__: `\(p_{2}/p_{1}\)`. --- ### Two categorical variables - Since the sample proportions in this case are related to the risk of heart attack, we refer to the relative proportion as the __relative risk__. - Here, the relative risk of suffering from heart attack is `$${p_2}/{p_1} = {0.0094} / {0.0171}= 0.55$$` - This means that the risk of a heart attack in the aspirin group is 0.55 times of the risk in the placebo group. --- ### Two categorical variables - It is more common to compare the __sample odds__, `$$\begin{equation*} o=\frac{p}{1-p}, \label{Eq:odds} \end{equation*}$$` - The odds of a heart attack in the placebo group, `\(o_1\)`, and in the aspirin group, `\(o_2\)`, are `$$\begin{eqnarray*} o _1 &=& \frac{0.0171}{(1-0.0171)} = 0.0174, \\ o _2 &=& \frac{0.0094}{(1-0.0094)} = 0.0095. \end{eqnarray*}$$` --- ### Two categorical variables - We usually compare the sample odds using the __sample odds ratio__ `$$\begin{eqnarray*} \mathit{OR}_{21} = \frac{o_2}{o _1} = \frac{0.0095}{0.0174}= 0.54. \end{eqnarray*}$$` - Later, we will discuss more advanced models for making statistical inference about odds ratio using logistic regression models. - Here, we use a simpler approach for assessing the significance of the relationship between two binary (and in general, two categorical) variables presented as a contingency table. --- ### Pearson's `\(\chi^{2}\)` Test of Independence - As discussed above, we can use contingency tables to find the observed frequencies for different combinations of categories of the two variables. - We denote the **observed** frequency in row `\(i\)` and column `\(j\)` as `\(O_{ij}\)`. - Using the independence rule, we can find the **expected** frequencies under the null hypothesis, which indicates that the two variables are independent. - Recall that for two independent random variables, the joint probability is equal to the product of their individual probabilities. --- ### Pearson's `\(\chi^{2}\)` Test of Independence - Pearson's `\(\chi^{2}\)` test uses a test statistic, which we denote as `\(Q\)`, to measure the discrepancy between the observed data and what we expect to observe under the null hypothesis (i.e., assuming the null hypothesis is true). - Note that the null hypothesis in this case states that the two variables are independent. --- ### Pearson's `\(\chi^{2}\)` Test of Independence - We denote the expected frequency in row `\(i\)` and column `\(j\)` as `\(E_{ij}\)`. - Pearson's `\(\chi^{2}\)` test summarizes the differences between the expected frequencies (under the null hypothesis) and the observed frequencies over all cells of the contingency table, `$$\begin{equation*} Q = \sum_{i} \sum_{j} \frac{(O_{ij} - E_{ij})^{2}}{E_{ij}}. \end{equation*}$$` --- ### Pearson's `\(\chi^{2}\)` Test of Independence - In practice, we simply use the `chisq.test()` in R. ```r asp <- matrix(c(189, 10845, 104, 10933), nrow=2, ncol=2, byrow = TRUE) asp ``` ``` ## [,1] [,2] ## [1,] 189 10845 ## [2,] 104 10933 ``` ```r chisq.test(asp) ``` ``` ## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: asp ## X-squared = 24.429, df = 1, p-value = 7.71e-07 ``` --- ### Smoking and low birthweight babies As another example, we will examine the association between smoking during pregnancy and having low birth weight babies using the `birthwt` from the `MASS` package. --- ### Smoking and low birthweight babies ```r library(MASS) data("birthwt") birthwt <- birthwt %>% as_tibble() %>% mutate(across(c(low, race, smoke, ht), as.factor)) head(birthwt) ``` ``` ## # A tibble: 6 × 10 ## low age lwt race smoke ptl ht ui ftv bwt ## <fct> <int> <int> <fct> <fct> <int> <fct> <int> <int> <int> ## 1 0 19 182 2 0 0 0 1 0 2523 ## 2 0 33 155 3 0 0 0 0 3 2551 ## 3 0 20 105 1 1 0 0 0 1 2557 ## 4 0 21 108 1 1 0 0 1 2 2594 ## 5 0 18 107 1 1 0 0 1 0 2600 ## 6 0 21 124 3 0 0 0 0 0 2622 ``` --- ### Smoking and low birthweight babies ```r tab <- table(birthwt$smoke, birthwt$low) res <- chisq.test(tab) res ``` ``` ## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: tab ## X-squared = 4.2359, df = 1, p-value = 0.03958 ``` --- ### Smoking and low birthweight babies ```r res$observed ``` ``` ## ## 0 1 ## 0 86 29 ## 1 44 30 ``` ```r res$expected ``` ``` ## ## 0 1 ## 0 79.10053 35.89947 ## 1 50.89947 23.10053 ```