/project-template

Directory structure template for computational research project

Primary LanguageJupyter Notebook

Project Template

A moderately opinionated file structure template for computational research project

Background

Reproducibility and file organization have been continuously discussed across computational research communities (see References). However, it remains a challenge to implement a one-size-fits-all standard as research projects come in all forms and sizes and continuously evolve. The following template offers general recommendations to quick start a typical computational research project while allowing some flexibility to add, remove, and edit its parts as needed.

This template is designed to be:

  • Consistent – follow certain structure
  • Simple – easy to start and navigate
  • Scalable – can be used for small or big projects of many kinds
  • Portable – enable synchronization across various computing platforms

This template is intended to be not a rigid set of rules, but a starting point to build upon.

Aims & Objectives

The main aim of this project template is to allow a quick and smooth onboarding / handover for a new person

Please keep this in mind when making additions / changes to the initial template. When in doubt, document what you did (write README file, comment codes, etc)

Directory structure

The default project structure is outlined below

<project_name>
├─ admin
├─ figures
├─ job_logs
├─ README.md
├─ resources
├─ results
├─ scripts
├─ tables
└─ workflow
    ├─ rules
    │  ├─ module1.smk
    │  └─ module2.smk
    ├─ envs
    │  ├─ tool1.yaml
    │  └─ tool2.yaml
    ├─ snakescripts
    │  ├─ script1.py
    │  └─ script2.R
    ├─ notebooks
    │  ├─ notebook1.py.ipynb
    │  └─ notebook2.r.ipynb
    ├─ report
    │  ├─ plot1.rst
    │  └─ plot2.rst
    └─ snakefile

Note

  • admin - Admin documents, e.g. meeting notes, applications, ethical approvals, MTA

  • resources - Read only data files / external softwares used as input for analysis and results

  • scripts - Ad hoc analysis scripts

  • results - large results / intermediate data files

  • figures - figures from analysis

  • tables - tables from analysis

  • writing - Analysis write ups, subfolders can be created specifically for early analysis drafts and later on manuscript drafts and final editions ready for submission to specific journals (this can also include reviewer comments and reply)

Quick start with cookiecutter

This git repository contains a cookiecutter template that can be used to automate creation of project with the structure above, if preferred.

Pre-requisites

  1. python 3
  2. cookiecutter
  3. git
  4. GitHub account
  5. ssh
  6. [Optional] sshfs
  7. [Optional] rclone

Initial set up

  1. Prepare the pre-requisites softwares and accounts above. If installing cookiecutter onto myriad is causing issues, undertake the following steps:
    1. SSH into myriad
    2. module load python3/3.8 - this step was undertaken because the default python is 2.7.9 (using python --version)
    3. python3 -m ensurepip --upgrade - not actually necessary for me but if pip isn't working this ensures pip is bootstrapped into the Python installation
    4. pip install cookiecutter - Installs cookiecutter

SSH Method

  1. Set up an SSH key following instructions found here: https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent. Note: If you are on Myriad this will require a different ssh key to your local machine.

  2. Add this SSH key to your github account.

  3. Run cookiecutter pointing to project template git repo

    cookiecutter git@github.com:ihi-comp-med/project-template.git

  4. When prompted, enter the project title, project directory name, GitHub username, GitHub repository name (make sure name is available), and GitHub personal access token. Leave blank to use the default value (shown in square bracket).

HTTPS Method

Issues may arise with this method as github doesn't seem to recognise when a personal access token is used.

  1. Generate a new GitHub personal access token, fill in the Note field and tick repo box under Select scopes, copy the generated token

  2. Open Command Line Interface (e.g. Terminal in Mac)

  3. Change directory to parent project directory

    cd my_directory

    NOTE if using Google Backup & Sync, this directory should be located inside the local copy of Google Drive

  4. Run cookiecutter pointing to project template git repo

    cookiecutter https://github.com/Hermes-consortium/project-template.git

  5. When prompted, enter the project title, project directory name, GitHub username, GitHub repository name (make sure name is available). Leave blank to use the default value (shown in square bracket).

Sync to Google Drive

Using Google Backup & Sync

  1. Create a local copy of Google Drive with Google Backup & Sync
  2. Follow steps above to set up local project directory with cookiecutter
  3. Choose what to sync (default to sync everything)

Using rclone

  1. Open Command Line Interface (e.g. Terminal in Mac)

  2. Set up a new rclone remote Google Drive

  3. Follow steps above to set up local project directory with cookiecutter

  4. Sync the new local project directory to Google Drive

    • Sync everything

      cd my_project_local
      rclone sync . my_GDrive:my_project_GDrive --create-empty-src-dirs -u
      
    • Selective sync with --filter-from flag

      cd my_project_local
      rclone sync . my_GDrive:my_project_GDrive --create-empty-src-dirs \
          -u --filter-from .rclone-filter
      
  5. Subsequent sync from/to Google Drive

    • Sync from Google Drive
    rclone copy my_GDrive:my_project_GDrive my_project_local \
        -u --filter-from .rclone-filter
    
    • Sync to Google Drive
    rclone copy my_project_local my_GDrive:my_project_GDrive \
        -u --filter-from .rclone-filter
    

NOTE:

  • -u update only (skip newer files)
  • .rclone-filter is an arbitrary-named hidden file to pass filtering rules to --filter-from argument. Think of it as .gitignore for rclone copy
  • rclone copy can be replaced with rclone sync to make sure both local and remote directories have the same contents, HOWEVER rclone sync can overwrite the destination folder contents so please proceed with caution.
  • Tip: add -n or --dry-run flag before syncing to check which files are copied / replaced.

<To-do: Google Backup & Sync vs. rclone>

<Note on date / chronological subfolders>

Computing Platforms

This project template utilises the following platforms:

Which folders live in which platforms?

local compute drive code storage
.git
admin
data
scripts
exploratory
results
writing

How to transfer / sync files across platforms?

local compute drive code storage
local ssh rclone
GBS/OneDrive
git ssh
rclone
compute ssh
rclone
rclone git ssh
rclone
drive rclone
GBS/OneDrive
rclone rclone (via local)
GBS/OneDrive (via local)
rclone
code git git rclone (via local)
GBS/OneDrive (via local)
git
storage ssh
rclone
ssh
rclone
rclone git

Note

References

File organization

Coding style

Other template