Update: 2020-04-15
Guidelines:
- A Quick Guide to Organizing Computational Biology Projects
- Good enough practices in scientific computing
- How to share data with a statistician.
- Data Organization in Spreadsheets
.
├── data
├── doc
├── README.md
├── result
└── src
To create this bare-bone structures, you can put mkproject
function in your .bashrc
. Each time you make a new project directory, you can simply run mkproject projectname
.
mkproject() {
mkdir -p $1/{data,src,result,doc}
touch $1/README.md
}
Directories/folders are in bold.
Some suggestions:
- Writing a description at the beginning of the script, and clear comments along with the script, so it is easier for someone else to find the relevant parts, and to understand the logic.
- Providing an example for input and output files (small ones for testing), as well as an example for a command.
- Providing a
README
file for each project, describing the purpose of this project, data origin and analysis methods. - When file numbers expanding, please use sub-directories to organize.
- ProjectAwesome
- data
- raw_fastq
- trim_fastq
- aligned_bam
- ...
- src
- functions
- model_simplification
- 01_data_prep.sh
- 02_adapter_trimming.slurm
- 03_STAR_alignment.slurm
- 04_DEG_analysis.R
- 05_analysis.py
- ...
- result
- featureCount
- featureCount_output.tsv
- featureCount.log
- DEG_limma
- DEG_limma_20200408
- DEG_DESeq2
- ...
- featureCount
- doc
- experimental_design.md
- sample_description.md
- DEG_summary.md
- paper_in_progress.docx
README.md
- data
Git version controls your code and Github host your code and version control history on cloud (for free!).
Tutorial: Happy Git and GitHub for the useR
For Git and Github, I usually don't syn data
and result
directory. You can add ignore those two directories in the .gitignore
file.
# Ignore directories.
data/
result/
You can also create .gitignore
files in sub-directories. For example, in the src
directory, I ignore .snakemake
and log
files. Users should decide what files are essential for reproducibility and worth version control.
# Ignore snakemake log files
.snakmake/
log/
For R projects, the Project-oriented workflow (please read) is highly recommended. You can easily set up in RStudio. If you don't use RStudio, you can adopt the idea. Manny has a video tutorial.
For Python, the Python Best Practices – The only guide to become Python Expert is a good starting point. Also, 30 Python Best Practices, Tips, And Tricks for ideas, and for adding some fancy stuff very easily. Some tips for mantaining your code over time Best Practices for Managing Your Code Library.
- Do your analysis in your
$SCRATCH
directory (5TB/user). All inactive files older than 60 days will be removed. Remember to backup your files in$ARCHIVE
and/or Google Drive (using rclone). NYU has unlimited Google Drive storage space. - When working with large quantity of small files and intense I/O,
$BEEGFS
is recommended (2TB/user). - Large genome sequence, index, annotation files can be stored in
/scratch/cgsb/coruzzi
for easy access. We currently have a large collection of plant genomes (Gil collected for the BigPlant genomes analysis), and transcriptomes that we sequneced (from both the EvoNet and the Gymnosperms projects). This is a huge resource for comparative genomics and evolutionary analyses. Gil has a record for each genome, which version was downloaded, from which database (with a link), and a PDF copy of the publication.
In addition, we currently have two grants for sequencing plant genomes:
- The new Zegar grant - Sequencing 8 grass genomes (from the Aristida genus). Project that deals with the evolution of C3->C4 photosynthesis, annuality <-> perenniality transitions, and drought adaptation.
- The Living Fossils project - Sequencing 5 huge gymnosperm genomes. Project that deals with genomic characterisitcs of diverging vs. not diverging gymnosperm lineages.
- Always store you data in the tidy format. The key ideas are:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Ji Huang(@timedreamer), Gil Eshel(@GilEshel)