forked from ashutoshnanda/data-science-process
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
250 lines (185 loc) · 7.55 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
title: "Data Science Process"
author: "Ashutosh Nanda<br>ashutoshnanda.github.io/data-science-process"
output:
ioslides_presentation:
css: cdss.css
fig_caption: yes
logo: CDSS.png
---
## Overview
- Data Problems
- OSEMN Process
- Tips and Tricks
## Data Problems
How do we use data to solve problems?
It's not always clear how to go from data to clear understanding. (Real world problems cannot by tackled by just reading in a CSV and blindly fitting a model.)
Example: Genome Sequencing <br>
Biologists were able to acquire full instruction manual on how an organism will develop, but multidisciplinary collaborations and advances in mathematical and statistical methods were needed to rigorously understand and fully utilize such sequences.
## Data Problems (continued)
Modern data problems don't come with a cookbook type solution because domain expertise is not always present for the datasets we are analyzing.
> `-` research in which data vastly outstrip our ability to posit models is qualitatively different
> <br>
> `-` complex systems for which the underlying models are not yet known but for which data are abundant
> <br>
> [Applying Big Data Approaches to Biological Problems, Chris Wiggins (2012)](http://engineering.columbia.edu/web/newsletter/fall_2012/applying_big_data_approaches_biological_problems)
<br> Hence, we need a **process** of doing data science.
## Process of Doing Data Science
Commonly Recommended Process: [OSEMN](http://www.dataists.com/2010/09/a-taxonomy-of-data-science/)
- Sounds like "awesome"
- Stands for:
+ Obtain
+ Scrub
+ Explore
+ Model
+ iNterpret
## Obtain
- Tends to be overlooked, but is in fact critical
- Process needs to be scalable
+ Using APIs
* Python: [`requests`](http://docs.python-requests.org/en/latest/)
* R: [`jsonlite`](https://cran.r-project.org/web/packages/jsonlite/vignettes/json-apis.html)
+ SQL Queries
* Python: [`pandas`](http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries) (with [SQLAlchemy](http://www.sqlalchemy.org/))
* Python: [`pandas`](http://www.datacarpentry.org/python-ecology/08-working-with-sql) (with [SQLite](https://www.sqlite.org/))
* R: [`dplyr`](https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html)
## Obtain
- Tends to be overlooked, but is in fact critical
- Process needs to be scalable
+ Command Line Tools
* UNIX Tools: `cat`, `grep`, `uniq`, `sort`, `sed`, `awk`
* [Data Science at the Command Line: Facing the Future with Time-Tested Tools (Janssens; 2014)](http://shop.oreilly.com/product/0636920032823.do)
+ Web Scraping
* Python: [`BeautifulSoup`](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html)
* R: [`rvest`](http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/)
## Obtain
<div class="notes">
You have some sort of folder organization, probably data, explore_plots, reports, etc. (Will get into it later)
</div>
Pseudo-code:
```
raw_data_folder = "<your folder here>"
list_of_files = get_list_of_files()
for file in list_of_files:
if (not using cache) or (file does not exist):
process(file, raw_data_folder)
```
## Scrub
The real world is very messy, so we need to do some clean up before processing it.<br>
Solution: [Tidy Data](http://vita.had.co.nz/papers/tidy-data.pdf)
[Elements of Tidy Data](http://jtleek.com/modules/03_GettingData/01_03_componentsOfTidyData/#2) according to Jeff Leek of JHU Data Science MOOC Fame:
<div style="margin-top: -20px">
1. Raw data
2. Tidy data set
3. A code book describing each variable and its values in the tidy data set
4. An explicit and exact recipe you used to go from 1 to 2 and 1 to 3
</div>
## Scrub
> A simple analysis of clean data can be more productive than a complex analysis of noisy and irregular data.
> <br>
> [Hilary Mason and Chris Wiggins, A Taxonomy of Data Science (2010)](http://www.dataists.com/2010/09/a-taxonomy-of-data-science/)
<br>
You should be able to explain your data in simple sentences as opposed to highly contrived transformations that aren't intuitive or are hard to follow.
## Scrub
Pseudo-code:
```
def clean_up(f):
f = transformation_1(f)
f = transformation_2(f)
f = transformation_3(f)
.
.
.
for file in raw_data_folder:
clean_data = clean_up(file)
store(clean_data, clean_data_folder)
```
## Explore
<div class="notes">
Define feature engineering
</div>
- Best ideas for feature engineering start here
+ Feature engineering: designing new variables that capture patterns in data
+ Most models perform roughly equivalently once you get enough data
- "Dirty" plots will help with this
+ Don't have to be pretty, have axis labels, etc.
+ Do have to be fast, so that we can iterate
- Notebook/R Markdown will allow you to note ideas that come in handy during modeling
## Explore
Pseudo-code:
```
for column in variables:
save(histogram of column)
for col1, col2 in pairs(variables):
save(scatterplot of (col1, col2))
```
## Model
- Make sure to use cross validation for hyperparameter tuning
- Make the right tradeoff between interpretability and performance
+ If you just care about prediction, go nuts with a deep learning network
+ If the goal is analysis and understanding, try something simpler like a Support Vector Machine
## Model
Pseudo-code:
```
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
svm = SVC()
svm.fit(X_train, y_train)
svm.score(X_test, y_test)
```
## iNterpret
> The purpose of computing is insight, not numbers.
> <br>
> [Richard Hamming, Numerical Methods for Scientists and Engineers (1962)](http://www-history.mcs.st-and.ac.uk/Extras/HammingReviews.html)
<br>
Besides you and other data nerds, no one really cares about the test error your model obtained; what people really care about is the *outcome* of using your model.
The best way to let people understand your model is to let them play with it.
## iNterpret
![](shiny.png)
<br>
Shiny lets you generate interactive web applications using purely R code. (You can edit the resultant HTML, CSS, and JavaScript code.)
## iNterpret
![](bokeh.png)
<br>
Bokeh lets you generate interactive web applications using Python and JavaScript code.
## iNterpret
Pseudo-code:
```
make_interactive_data_product()
```
## Tips and Tricks
- 80/20 Split on Cleaning vs Modeling
- More Data Wins
- Reproducibility
- Iteration
## The 80/20 Split
<div class="notes">
Not just for training and testing proportions!
</div>
- Time spent preparing data compared to time modeling data is 80/20
- Have to get the 80% right to get interesting 20% results
## More Data Wins
Scaling to Very Very Large Corpora for
Natural Language Disambiguation (Banko, Brill; 2001)
![](more_data.png)
## Reproducibility
- We need to be able to retrace our steps
+ Helpful for debugging
- Others need to recreate our results
+ Gives credibility to our findings
## Iteration
![An astute comment from Hadley Wickham](wickham_comment.png)
- creator of `ggplot2`, `dplyr`, `devtools`
- Chief Scientist at RStudio
- ... yeah, that Hadley Wickham
## Thanks
Any questions?
## Next Steps
- Look at the links in the presentation
- So many good resources!
- [DevFest Data Science Curriculum](http://learn.devfe.st/datascience/)
- Stay tuned for more CDSS events!
- [Mailing List](https://lists.columbia.edu/mailman/listinfo/cdss)
- [Facebook](https://www.facebook.com/cdsscu)
- [Website](http://cdssatcu.com/)