class: title-slide <br> <br> .right-panel[ <br> # Data Science ## Jessica Jaynes ] --- ## What is data science - Data-intensive studies have led to a new paradigm in scientific research. - Within this new paradigm, there are emerging challenges involving analysis of large-scale datasets. - To tackle these challenges, the field of __data science__ brings together _statistics, computer science_, and _mathematics_ to solve data intensive problems. - At its core, data science relies on statistical methods to solve scientific problems. These methods have their foundation in mathematics and are implemented using computational techniques to solve real-life problems in a wide range of scientific fields. --- class: inverse center middle .font100[Data Science in Scientific Studies] --- ## Some general terminologies - To study a population, we measure a set of characteristics, which we refer to as __variables__. - The objective of many scientific studies is to learn about the __variation__ of a specific characteristic (e.g., BMI, disease status) in the population of interest. - In many studies, we are interested in possible __relationships__ among different variables. --- ## Some general terminologies - We refer to the variables that are the main focus of our study as the __response__ (or target) variables. - In contrast, we call variables that explain or predict the variation in the response variable as __explanatory__ variables or __predictors__ depending on the role of these variables. - Statistical analysis begins with a scientific problem usually presented in the form of a __hypothesis testing__, __estimation__, or __prediction__ problem. --- ## Alzheimer's data <img src="img/AlzheimerData.png" width="90%" style="display: block; margin: auto;" /> --- ## Study design Scientific studies work better and have a higher probability of success if we plan them ahead. Common study designs include: - Survey studies: Researchers collect information from individuals through some questions - Observational studies: Researchers are passive examiners, trying to have the least impact on the data collection process. - Experiments: Researchers attempt to control the process as much as possible. --- ## Sampling - We cannot of course implement our study for the whole population of interest. - Instead, we select a __sample__ of representative members from the population. --- ## Statistical inference - We collect data on a sample from the population in order to learn about the whole population. - Note that in general the characteristics, relationships, and realities in the whole population always remain unknown. - Therefore, there is always some __uncertainty__ associated with our inference. - In Statistics, the mathematical tool to address uncertainty is __probability__. --- ## Statistical inference - The process of using the data to draw conclusions about the whole population, while acknowledging the extent of our uncertainty about our findings, is called __statistical inference__. - The knowledge we acquire from data through statistical inference allows us to make decisions with respect to the scientific problem that motivated our study and our data analysis. - In this program, we will discuss some core statistical inference methods and use them to solve some scientific problems related to Alzheimer's disease. - Before doing that, however, we need to understand the problem and carefully explore (and describe) the data; this will be the focus of Week 1.