-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbetterDataManagement.Rpres
674 lines (486 loc) · 20.8 KB
/
betterDataManagement.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
<style>
.footer {
color: #434343;
background: #ffffffff;
position: fixed;
top: 90%;
text-align: left;
width: 100%;
}
.header {
color: black;
background: #E8E8E8;
position: fixed;
bottom: 90%;
text-align:center;
width:100%;
}
.small-code pre code {
font-size: 0.9em;
}
.column-left{
float: left;
width: 50%;
text-align: left;
}
.column-right{
float: right;
width: 50%;
text-align: right;
}
.center{
left: 50%;
text-align: center;
}
</style>
effective data management
========================================================
author:
date: 2017-11-17
autosize: true
font-family: 'Arial'
width: 1920
height: 1080
workflow
========================================================
title: false
research data workflow: this is your world (or it will be soon)
<div align="center" style="margin-top:0%;">
<img src="./figures/workflow.png" width="90%"" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
<div class="footer" style="font-size:60%;">
<hr>
<p>modified from Jenny Bryan’s UBC Stat 545 course (http://stat545.com/) who adapted it from Roger Peng (biostat.jhsph.edu/~rpeng/)</p>
</div>
challenges
========================================================
title: true
* how do you keep that up-to-date?
* what if something changes, what if something needs to be redone - how do you manage that?
* why do the results in table 1 not seem to correspond to the results in figure 1?
* why were those particular samples omitted?
* where did I get these data?
* how did I make that figure?
<br>
<b>basically, if the thought of redoing your analyses is terrifying then you are doing it wrong (paraphrasing Jenny Bryan)</b>
workflow with data
========================================================
title: false
<p style="margin-bottom:0;">research data workflow: this is your world (or it will be soon)</p>
<p style="margin-bottom:0;">now add to that publishing your <span style="color:#71A3C3;">data and code</span></p>
<div align="center" style="margin-top:0%;">
<img src="./figures/workflow_with_data.png" width="90%"" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
<div class="footer" style="font-size:60%;">
<hr>
<p>modified from Jenny Bryan’s UBC Stat 545 course (http://stat545.com/) who adapted it from Roger Peng (biostat.jhsph.edu/~rpeng/)</p>
</div>
reproducible research
========================================================
title: false
<br>
<p>
<b>reproducible</b>: the calculation of quantitative sciencific results by independent researchers using the original data and methods (National Science Foundation Subcommittee on Replicability in Science)
</p>
<br>
<p>
not as stringent as <b>replicable</b>: can someone repeat the experiment and get the same result
</p>
<div class="footer" style="font-size:60%;">
<hr>
<p>Steps toward reproducible research, Karl Broman, Biostatistics & Medical Informatics Univ. Wisconsin–Madison, kbroman.org, github.com/kbroman, @kwbroman, Slides: bit.ly/jsm2016</p>
</div>
what we will cover today
========================================================
title: true
* best practices
+ naming
+ organiztion
+ scripting
* spreadsheets
* literate programming
* version control
* getting started
* data management plans & publishing your data
BP (Best Practice): naming - principles for file names
========================================================
title: true
* machine readable
* human readable
* plays well with default ordering
<div class="footer" style="font-size:60%;">
<hr>
<p>"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files</p>
</div>
BP: naming - machine readable
========================================================
title: true
* regular expression and globbing friendly
+ avoid spaces, punctuation, accented characters, case sensivity
* easy to compute on with deliberate use of delimiters
<p style="margin-bottom:0;">example: <b>"2017-11-17_berneilwash_oxygen_day_1.csv"</b></p>
<p style="margin-bottom:0;">underscores allow us to delimit units of metadata and facilitate searching</p>
<br>
* easy to search for files later
* easy to narrow file lists based on names
* easy to extract information from the file names, e.g., by splitting
* avoiding spaces, punctuation, accented characters, case sensivity will make your life much easier
<div class="footer" style="font-size:60%;">
<hr>
<p>"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files</p>
</div>
BP: naming - human readable
========================================================
title: true
* names contain info about the content
* easy to figure out what what something is based on the name
for example <i><b>2016_salmon_counts.csv</i></b> actually conveys a lot of information about the object, and has a whole lot more meaning than <i><b>fishData.csv</i></b>
<div class="footer" style="font-size:60%;">
<hr>
<p>"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files</p>
</div>
BP: naming - names matter in times of stress
========================================================
title: true
which set of file(names)s would you prefer at 3 a.m. before a dealine?
<div align="center" style="margin-top:0%;">
<img src="./figures/good_bad_names.png" width="75%"" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
<div class="footer" style="font-size:65%;">
<hr>
<p>"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files</p>
</div>
BP: naming - plays well with default ordering
========================================================
title: true
* put something numeric first
* use the ISO 8601 standard for dates (YYYY-MM-DD) everwhere, always without exception ever
* left pad numbers with zeroes as needed
+ 1_file_name.csv
+ 11_file_name.csv
+ 2_file_name.csv
+ --------------------
+ 01_file_name.csv
+ 02_file_name.csv
+ 11_file_name.csv
<div class="footer" style="font-size:60%;">
<hr>
<p>"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files</p>
</div>
BP: dates matter (a lot!)
========================================================
title: true
<div align="left" style="margin-top:0%;">
<img src="./figures/noniso.png" width="50%"" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
<small>
comprehensive map of all countries in the world that use MMDDYYYY format
use the ISO 8601 standard for dates (YYYY-MM-DD) everwhere, always without exception ever
</small>
<div class="footer" style="font-size:60%;">
<hr>
<p>"naming things" by Jenny Bryan https://speakerdeck.com/jennybc/how-to-name-files</p>
</div>
BP: employ sound project organization
========================================================
title: true
left: 80%
<small>your closest collaborator is you six months ago, but you do not reply to emails (Karl Broman paraphrasing Mark Holder)</small>
<div style="float: right; width: 40%;">
<br>
<ul>
<li>make the project understandable to others where others includes your future self</li>
<li>segregate all the materials for a project in one directory</li>
<li>separate raw from processed data; put code in a separate directory</li>
<li>include README files</li>
</ul>
</div>
<div style="float: left; width: 60%;">
<img src="./figures/projectOrganizationCropped.png" width="60%"" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
<div class="footer" style="font-size:60%;">
<hr>
<p>Steps toward reproducible research, Karl Broman, Biostatistics & Medical Informatics Univ. Wisconsin–Madison, kbroman.org, github.com/kbroman, @kwbroman, Slides: bit.ly/jsm2016</p>
</div>
BP: research compendium
========================================================
title: true
...a standard and easily recognisable way for organising the digital materials of a project to enable others to inspect, reproduce, and extend the research
general principles:
* organize according to prevailing conventions (e.g., R package structure)
* maintain a clear separation of data, method, and output, while unambiguously expressing the relationship between the three
* specify the computational environment used for the original analysis
* organize such that another person can know what to expect from the plain meaning of the file and directory names
<div class="footer" style="font-size:60%;">
<hr>
<p>Marwick B, Boettiger C, Mullen L. (2017) Packaging data analytical work reproducibly using R (and friends) PeerJ Preprints 5:e3192v1 https://doi.org/10.7287/peerj.preprints.3192v1</p>
</div>
BP: keep the raw data raw
========================================================
title: true
* save the raw data
<div class="footer" style="font-size:60%;">
<hr>
<p>Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510</p>
</div>
BP: thoughtful curation
========================================================
title: true
* curate your data in a way that you would like to receive it
<div class="footer" style="font-size:60%;">
<hr>
<p>Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510</p>
</div>
BP: open file formats
========================================================
title: true
* use open file formats (e.g., csv not xlsx)
<div class="footer" style="font-size:60%;">
<hr>
<p>Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510</p>
</div>
BP: analysis-friendly data
========================================================
title: true
* create analysis-friendly data:
+ each column a variable
+ each row an observation
<div class="footer" style="font-size:60%;">
<hr>
<p>Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510</p>
</div>
BP: create analysis-friendly data
========================================================
title: false
* each column a variable
* each row an observation
```{r, echo=FALSE}
knitr::kable(head(iris))
```
<div class="footer" style="font-size:60%;">
<hr>
<p>Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510</p>
</div>
BP: create analysis-friendly data
========================================================
title: false
* each column a variable
* each row an observation
```{r, echo=FALSE}
library(tidyverse)
knitr::kable(head(gather(head(iris), key = "metric", value = "value", -Species), n=15))
```
<div class="footer" style="font-size:60%;">
<hr>
<p>Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510</p>
</div>
BP: record everything
========================================================
title: true
* record all the steps of the data process
BP: everything in a script
========================================================
title: true
<p style="color:#0000FF">The most basic principle for reproducible research is: do everything via code</p>
* downloading data from the web,
* converting an Excel file to CSV,
* renaming columns or variables,
* omitting bad samples or data points
* ...do all of these with scripts
You will be tempted to open up a data file and hand-edit. But if you get a revised version of that file, you will need to do it again, and it will be harder to figure out what it was that you did.
Some things are more cumbersome via code but you will save time in the long run.
<div class="footer" style="font-size:60%;">
<hr>
<p>Steps toward reproducible research, Karl Broman, Biostatistics & Medical Informatics Univ. Wisconsin–Madison, kbroman.org, github.com/kbroman, @kwbroman, Slides: bit.ly/jsm2016</p>
</div>
BP: learn a language. any language, just do it
========================================================
title: true
<div style="float:left; width:50%; margin-top:5%;">
<img src="./figures/rlogo.jpg" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
<div style="float: right; width: 50%; margin-top:6%;">
<img src="./figures/python-logo.png" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
spreadsheets: the dark side
========================================================
title: true
<div align="center" style="margin-top:5%;">
<img src="./figures/excel_errors_wash_post.png" width="50%"" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
spreadsheets: the 2nd best tool for everything
========================================================
title: true
<div style="float:left; width:60%; margin-top:0%;">
<img src="./figures/spreadsheet_hidden_data.png" width="90%"" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
<div style="float: right; width: 40%;">
the beauty and the travesty of spreadsheets is that they allow you to do just about anything
<br>
<br>
<ul>
<li>data in merged cells</li>
<li>data in formatting</li>
<li>small multiples</li>
<li>data in formulas</li>
</ul>
<p>use open file formats (e.g., CSV, not XLSX)</p>
<br>
</div>
<div class="footer" style="font-size:60%;">
<hr>
<p>"spreadsheets" by Jenny Bryan https://speakerdeck.com/jennybc/spreadsheets</p>
</div>
spreadsheets versus scripting
========================================================
title: false
consider, we want to remove samples that we feel may have been contaminated...
<div style="float:left; width:50%;">
<p>
<b>in a R script</b>
</p>
<hr>
<p>...some work...</p>
<p style="margin-bottom:0;"># remove samples 4, 5, 6, that may</p>
<p style="margin-bottom:0;"># have been compromised due to</p>
<p style="margin-bottom:0;"># wading upstream during sampling</p>
<br>
<p style="margin-bottom:0;">chemistry_data %>%
<p style="margin-bottom:0; text-indent:1em;">filter(!sample_id %in% c(4,5,6))</p>
<br>
<p>...more work...</p>
</div>
<div style="float:right; width:49%;">
<p>
<b>in a spreadsheet</b>
</p>
<hr>
<img src="./figures/spreadsheet_remove_samples.png" width="90%"" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
literate programming
========================================================
title: true
a common approach
<div>
<img src="./figures/literateProgramming.png" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
<div class="footer" style="font-size:60%;">
<hr>
<p>"The Plain Person's Guide to Plain Text Social Science" version 2017-06-19 by Kieran Healy "https://kieranhealy.org/files/papers/plain-person-text.pdf"</p>
</div>
literate programming
========================================================
title: true
* the problem is that the gaps are particularly prone to errors
* literate programming is essentially the integration of code and text
<small>Knuth, D. E. (1992), Literate programming, CSLI Lecture Notes, Stanford, CA: Center for the Study of Language and Information (CSLI), 1992</small>
<div>
<img src="./figures/literateProgramming.png" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
<div class="footer" style="font-size:60%;">
<hr>
<p>"The Plain Person's Guide to Plain Text Social Science" version 2017-06-19 by Kieran Healy "https://kieranhealy.org/files/papers/plain-person-text.pdf"</p>
</div>
version control
========================================================
title: true
<div align="center" style="margin-top:0%;">
<img src="./figures/CAP_renewal_screenshot.png" width="80%" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
version control: Git
========================================================
title: true
* Git watches repositories (like a directory) for changes
* It asks that you describe changes when they are made
* It remembers old versions if you need them
* It also keeps an eye out for conflicts, and forces you to resolve them
* It (through GitHub) allows multiple people to contribute to the same repository, and does all of the above for everyone at once
<div class="footer" style="font-size:60%;">
<hr>
<p>"Git" by Jeff Goldsmith https://speakerdeck.com/jeffgoldsmith/dsi-git-and-github</p>
</div>
version control: Git and GitHub
========================================================
title: true
<p align="center" style="margin-top:10%;">
<b>Git != GitHub</b>
</p>
<br>
* Git lives on your computer
* GitHub is a web-based platform for storing collaboration and facilitating collaboration
<div class="footer" style="font-size:60%;">
<hr>
<p>"Git" by Jeff Goldsmith https://speakerdeck.com/jeffgoldsmith/dsi-git-and-github</p>
</div>
I am not saying it is easy
========================================================
title: true
Kieran Healy on two revolutions in computing:
"On one side, the mobile, cloud-centered, touch-screen, phone-or-tablet model has brought powerful computing to more people than ever before."
On the other side, tools for coding, data analysis, and writing are also revolutionary but mostly work by gluing together separate, specialized widgets that do much less to hide the operation system layer, and require knowledge of things like the file system.
<div class="footer" style="font-size:60%;">
<hr>
<p>"The Plain Person's Guide to Plain Text Social Science" version 2017-06-19 by Kieran Healy "https://kieranhealy.org/files/papers/plain-person-text.pdf"</p>
</div>
OHI: nature ecology & evolution
========================================================
title: true
<div align="center" style="margin-top:0%;">
<img src="./figures/lowndes_title.png" width="80%"" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
<div class="footer" style="font-size:60%;">
<hr>
<p>Our path to better science in less time using open data science tools. Julia S. Stewart Lowndes, Benjamin D. Best, Courtney Scarborough, Jamie C. Afflerbach, Melanie R. Frazier, Casey C. O’Hara, Ning Jiang & Benjamin S. Halpern. Nature Ecology & Evolution 1, Article number: 0160 (2017) doi:10.1038/s41559-017-0160</p>
</div>
OHI: framework
========================================================
title: true
<div align="center" style="margin-top:0%;">
<img src="./figures/ohi_framework.png" width="80%"" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
<div class="footer" style="font-size:60%;">
<hr>
<p>Our path to better science in less time using open data science tools. Julia S. Stewart Lowndes, Benjamin D. Best, Courtney Scarborough, Jamie C. Afflerbach, Melanie R. Frazier, Casey C. O’Hara, Ning Jiang & Benjamin S. Halpern. Nature Ecology & Evolution 1, Article number: 0160 (2017) doi:10.1038/s41559-017-0160</p>
</div>
OHI: evolution of a workflow
========================================================
title: true
<div align="center" style="margin-top:0%;">
<img src="./figures/lowndes_figure1.jpg" width="60%"" style="background-color:transparent; border:0px; box-shadow:none;"></img>
</div>
<div class="footer" style="font-size:60%;">
<hr>
<p>Our path to better science in less time using open data science tools. Julia S. Stewart Lowndes, Benjamin D. Best, Courtney Scarborough, Jamie C. Afflerbach, Melanie R. Frazier, Casey C. O’Hara, Ning Jiang & Benjamin S. Halpern. Nature Ecology & Evolution 1, Article number: 0160 (2017) doi:10.1038/s41559-017-0160</p>
</div>
BP: forethought
========================================================
title: true
strive for reproducibility from the outset
data management plan
========================================================
title: true
describes how data will be collected, managed, and preserved
for example, NSF's generic guidelines:
* roles and responsibilities
* types of data produced
* data and metadata standards
* policies for access and sharing
* policies for reuse, redistribution
* plans for archiving and preservation
publishing your data
========================================================
title: true
RDM course slide
========================================================
title: false
<b>Research Data Management</b>
Seminar: SOS 598 (24085)
When: Spring 2018
Day/time: Friday, 12:15-1:30 PM
1 credit hour
bookmarks
========================================================
title: false
<!-- have these bookmarks open: -->
<!-- https://www.re3data.org/ -->
<!-- https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sbc&identifier=74 -->