/Programming_notes

Programming-related notes

MIT LicenseMIT

Programming-related notes

License: MIT PR's Welcome

Programming learning and data analysis resources. Please, contribute and get in touch! See MDmisc notes for other programming and genomics-related notes.

Table of content

Awesome

Cheatsheets

Command line

Courses

Tools

  • fd - A simple, fast and user-friendly alternative to 'find'. Ignores hidden files/folders by default.

  • ripgrep - recursively searches directories for a regex pattern while respecting your gitignore, binary files, hidden directories. Very fast.

Code best practices

Docker

  • Rockerverse - Docker/containerization and R. Review of packages and applications for working with R in containers. Links to packages, examples of applications (Bioconductor, Data Science), deployment of R containers on the cloud.
    Paper Nüst, Daniel, Dirk Eddelbuettel, Dom Bennett, Robrecht Cannoodt, Dav Clark, Gergely Daroczi, Mark Edmondson, et al. "The Rockerverse: Packages and Applications for Containerization with R" http://arxiv.org/abs/2001.10641 ArXiv:2001.10641 [Cs], January 28, 2020

Kubernetes

Cloud

  • awesome-cloudrun - A curated list of resources about all things Cloud Run

  • cloud-run-faq - Unofficial FAQ and everything you've been wondering about Google Cloud Run.

  • CloudBank - NSF-funded cloud computing for education, training, and allocation for cloud computing resources.

  • serverless-architecture - 'Serverless Architecture' course at Linked In Learning, by Lynn Langit

  • SkyPilot - a framework for easily running machine learning workloads on any cloud through a unified interface.

    Paper Yang, Zongheng, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, et al. “SkyPilot: An Intercloud Broker for Sky Computing,” n.d.
  • The Open Science Grid - A national, distributed computing partnership for data-intensive research.

  • The Cancer Genomics Cloud (CGC) - scientific cloud computing by Seven Bridges. Contains many public datasets (TCGA, CCLE, etc.), controlled access supported. Uses AWS. Pipelines are packaged with Docker. Execution instructions are described using Common Workflow Language (CWL).

    Paper Lau, Jessica W., Erik Lehnert, Anurag Sethi, Raunaq Malhotra, Gaurav Kaushik, Zeynep Onder, Nick Groves-Kirkby, et al. “The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized—A New Paradigm in Large-Scale Computational Research.” Cancer Research 77, no. 21 (November 1, 2017): e3–6. https://doi.org/10.1158/0008-5472.CAN-17-0387.

GCP

AWS

Git

Text

Text mining

SQL

Workflows

  • GenPipes - Python pipeline framework for multi-step workflows. Over 12 pipelines for RNA sequencing, chromatin immunoprecipitation sequencing, DNA sequencing, methylation sequencing, Hi-C, capture Hi-C, metagenomics, and Pacific Biosciences long-read assembly. Can be run via Docker. Creates executable scripts for PBS, SLURM, Batch, Daemon job schedulers. How to run: <pipeline>.py -c myConfigurationFile -r myReadSetFile -s 1- X > Commands.txt && bash Commands.txt where <pipeline> can be any of the 12 available pipelines and X is the step number desired. Commands.txt contains the commands that the system will execute. Input: FASTQ or BAM files.
    Paper Bourgey, Mathieu, Rola Dali, Robert Eveleigh, Kuang Chung Chen, Louis Letourneau, Joel Fillon, Marc Michaud, et al. “GenPipes: An Open-Source Framework for Distributed and Scalable Genomic Analyses.” GigaScience 8, no. 6 (June 1, 2019): giz037. https://doi.org/10.1093/gigascience/giz037.

Makefiles

Snakemake

WDL

Nextflow

  • nf-core - community-curated guidelines for pipeline building using the Nextflow framework. Software bundled with pipelines using Conda, Docker/Singularity, Bioconda, Conda-forge, BioContainers repositories. Software bundles (yaml environment built into Docker container), continuous integration, common structure, documentation, simplicity requirements for pipelines. Extension tools: Flowcraft - A Nextflow pipeline assembler for genomics. Pipeliner - A flexible Nextflow-based framework for the definition of sequencing data processing pipelines. Similar concept - Snakemake-workflows. nf-core tools on Bioconda and PyPi. Available pipelines.
    Paper Ewels, Philip A., Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso, and Sven Nahnsen. “The Nf-Core Framework for Community-Curated Bioinformatics Pipelines.” Nature Biotechnology, February 13, 2020. https://doi.org/10.1038/s41587-020-0439-x.

Web

Miscellaneous

Other programming languages

See R_notes and Python_notes repositories for those languages