DataFest

Materials and code for Harvard DataFest workshop! This README contains my notes.

Reproducible research and workflows

Useful resources:

A Quick Guide to Organizing Computational Biology Projects

Computing Workflows for Biologists: A Roadmap

Data

It is always gathered in specific ways with specific assumptions. Digital data is a specific encoding, in bits, of other data-- and corruptions can occur in each layer. Choices about how to structure categories, how to serve databases, etc. are all significant.

Open reproducible science

Always consider:

Culture of practice
Institutional guidelines
Funder policies

Things that may help:

Create a DMP that thinks through the whole process (including curation afterwards)
Systematic version control
File formatsand metadata
Software is part of 'the data' - document full pipelines
Workflow support tools (e.g., github)
Make the work citable and archive it

Data management resources for biomedical research

A useful data management reference

Workshop

Replicable research != Reproducible research

Replicable means doing same research but with new data. But full replication isn't always feasible; limited resources for gathering new data, or the original research already sampled the universe of cases, etc. Reproducible means there's sufficient information for other people to use the same procedures/code/data/etc to make the same finding.

"Quantitative data science is the creation of a computer programme to gather, transform, analyse data and present the results.

Literate programming paradigm: the human-readable presentation of a programme is interspersed with computer source code that are compiled together (see Knuth 1992)."

For rest of workshop, see slides/Rmd riles.