Materials and code for Harvard DataFest workshop! This README contains my notes.
A Quick Guide to Organizing Computational Biology Projects
Best Practices for Scientific Computing
Computing Workflows for Biologists: A Roadmap
It is always gathered in specific ways with specific assumptions. Digital data is a specific encoding, in bits, of other data-- and corruptions can occur in each layer. Choices about how to structure categories, how to serve databases, etc. are all significant.
Always consider:
- Culture of practice
- Institutional guidelines
- Funder policies
Things that may help:
- Create a DMP that thinks through the whole process (including curation afterwards)
- Systematic version control
- File formatsand metadata
- Software is part of 'the data' - document full pipelines
- Workflow support tools (e.g., github)
- Make the work citable and archive it
A useful data management reference
Replicable research != Reproducible research
Replicable means doing same research but with new data. But full replication isn't always feasible; limited resources for gathering new data, or the original research already sampled the universe of cases, etc. Reproducible means there's sufficient information for other people to use the same procedures/code/data/etc to make the same finding.
"Quantitative data science is the creation of a computer programme to gather, transform, analyse data and present the results.
Literate programming paradigm: the human-readable presentation of a programme is interspersed with computer source code that are compiled together (see Knuth 1992)."
For rest of workshop, see slides/Rmd riles.