-
Notifications
You must be signed in to change notification settings - Fork 14
/
io2a-alt_intro.Rmd
189 lines (152 loc) · 8.27 KB
/
io2a-alt_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: "Alternate Introduction"
author: "Peter Higgins"
date: "5/3/2020"
output: html_document
---
## Introduction {#intro}
There are many books about Data Science.
<br>
Why does the world need another one, particularly one targeting physicians?
- There is a lot of health care data
- There are a lot of interesting questions in health care
- There are particular and challenging issues in doing data analysis with PHI (Protected Health Information)
Syllabus: Data Science for Physicians (DS4P)<br>
- Instructor: Peter Higgins, MD, PhD, MSc (CRDSA), Professor of Internal Medicine
- Office Hours: MSRB One 6510
- In-person class time
- MSRB One 6510, Thursday evenings 6:30-8:30 PM
<br>
## Course Description and Objectives
### Description
A practical introduction to data collection and security, data cleaning, statistical methods and computational tools needed to make sense of data, and methods for reporting and sharing your findings. This course is not a traditional introductory statistics courses in that computing plays a more central role than mathematics and a higher emphasis is placed on “thinking with data.” Topics include
- secure HIPPA-compliant data collection
- data cleaning and validation
- data visualization
- data wrangling
- confidence intervals
- hypothesis testing, and
- regression
The course has no mathematics or computer science prerequisites.
<br>
### Objectives
1. Have students engage in the data/science research pipeline in as faithful a manner as possible while maintaining a level suitable for novices.
2. Foster a conceptual understanding of statistical topics and methods using real clinical data whenever possible, and simulation/resampling to support teaching concepts of inference.
3. Use a flipped classroom model by incorporating online learning for new concepts, with limited face-to-face time for real-time problem-solving
4. Introduce best practices for reproducible research and collaboration.
5. Develop statistical literacy by, among other ways, tying in the curriculum to actual clinical data, demonstrating the importance statistics and computing plays in advancing medicine
<br>
### Topics
Roughly speaking we will cover the following topics (a more detailed outline is found below:
1. Introduction and Tools (R, RStudio, and R Markdown)
2. Data Import, and Handling
3. Data Collection
4. Checking, Validating, And Exploring your Data
5. Data Types
6. Data Wrangling with Tidyr and Dplyr
7. Graphic Summaries for a Single Variable – ggplot package
8. Descriptive Data for a Single Variable
9. Graphic Summaries for Two or More Variables – ggplot2
10. Descriptive Data for Two or More Variables
11. Presenting your Results in a report with RMarkdown
12. Statistical inference
13. Study Design
14. Sample Size and Power
15. Sources of Bias
16. Study Types
17. One variable, single group
18. One variable, two groups
19. Multiple groups
20. Linear Regression
21. Reporting results interactively with Shiny
22. Logistic Regression
23. Meta-Analysis
<br>
### Learning Resources
• E-Textbooks:
Open Intro Statistics, at www.openintro.org
• E-Books on R
<br>
These are at different levels:
Level: Absolute Beginner
Textbook: R Basics
Goal: Set up R and RStudio on a laptop, introduce the concept of an IDE
Link:
Level: New to R & Statistics
Textbook: Modern Dive
Goal: Learn basics of Data Management and visualization, introduction to hypothesis testing and statistical modeling
Link:
<br>
Level: Comfortable with R
Textbook: Hands-On R Programming
Goal:
Link:
<br>
Level: Ready to Understand More
Textbook: R for Data Science
Link:
<br>
• Software:
- Local laptop/desktop free open-source version of R and RStudio
- Cloud-based RStudio Server, which you can access in your browser via:
<br>
Note if you are off-campus you must first log into the UM VPN.
<br>
• Online:
- DataCamp. A brower based interactive tool for learning R through short, focused courses, each 3-4 hours long.
- RStudio. Website with many resources for learning about the RStudio IDE and the tidyverse.
- r-cookbook – an often useful website with concrete examples of how to use R packages
- Stack Overflow and Google. Remarkably helpful to search for explanations of error messages, or explanations of problems that someone else has probably also experienced. For using Google, search for any topic or your error message and add “in R”
- package vignettes – variable quality, but when well done, can be extremely helpful examples of how to use the functions in each package
- R twitter – follow Rbloggers, #rstats
<br>
### Evaluation
This course is entirely voluntary. I hope that you will learn valuable skills that will advance your research career. I would like you to progress to using these skills on your own data as quickly as possible, as this will greatly help you reinforce your new skills. There are no grades and no formal evaluations. You can, however, earn certificates on DataCamp for completing courses.
<br>
### Task Goals
1. Learn concepts through RYouWithMe
2. Test yourself with assignments in ModernDive
3. Challenges
a. Clean data and perform descriptive data analysis on the biofire dataset
b. Clean data and model outcomes in the health satisfaction dataset, producing a final report
c. Use logistic regression to model dichotomous outcomes and produce a Shiny app to allow users to make predictions for future patients
4. Final Project
There will be a final capstone project. This is an opportunity for you to use your statistics and data science skills developed during the challenges and perform your own start-to-finish data analysis project. The project will involving you addressing a scientific question by choosing a data set (or preferably, using one of your own), performing an analysis using the concepts and tools we have covered in this course, and writing a report. This can be done solo or with a partner.
<br>
### Learning Goals
1. Recognize the importance of data collection, identify limitations in data collection methods, and determine how they affect the generalizability of your findings
2. Use statistical software (R) to summarize data numerically and visually, and to perform data analysis.
3. Have a conceptual understanding of statistical inference.
4. Apply estimation and testing methods to analyze single variables or the relationship between two variables in order to understand data relationships and make data-based conclusions.
5. Model numerical response variables and dichotomous response variables using a single explanatory variable or multiple explanatory variables in order to investigate relationships between variables.
6. Interpret results correctly, effectively, and in context without relying on statistical jargon.
7. Critique data-based claims and evaluate data-based decisions.
<br>
### Tips for success
1. Read materials for each unit
2. Study intro materials at your own speed at [RYouWithMe?](https://rladiessydney.org/courses/ryouwithme/)
3. Save your code as RMarkdown or RScript Files
4. Annotate your code to help ‘future you’ understand it.
5. Work on your own data and projects within RStudio Projects
6. Use the {here} package in your projects to manage paths to files
7. For important projects that you will want to revisit later (make reproducible), maintain project-specific libraries with the {renv} package
8. Try the *Challenges* in each chapter - ask for help from colleagues if you get stuck
9. Learn how to make a *minimal reprex* with the {reprex} package and post questions to the [RStudio Community](https://community.rstudio.com)
10. Learn from each project - save and reuse your code (and functions) for future projects
<br>
### Expected work load
This course is entirely voluntary. It is expected that you have lots of clinical and/or research work to keep up with, along with the occasional call or night rotation. This is an investment in future skills to help your career. I recommend that you try to do up to one hour a day on most days, and on days when that is not realistic, to just do the 10 minutes of daily practice on DataCamp to keep the information fresh in your mind.
<br>
Other learning resources:
<br>
<br>
<br>
## Access RStudio cloud
<br>
## R basics E-book
<br>
Use the e-book Rbasics by Chester Ismay
https://ismayc.github.io/rbasics-book/
<br>
## RStudio tips document
<br>