TJT edits

I found this online skills test and forked it for a project idea. I've been wanting to explore some other programming languages. Inspired by this repo, I was thinking it might be interesting to try executing the same bioinformatics task in a few different languages, just to give myself a framework for learning and comparison.

Might be a little while until I really have time to get to this. For now, some ideas:

  1. Will start off in R, my "native" language, but might try to implement a couple different ways within R. Could code up a solution based on my current patterns and usage, but I'd also like to try some new packages and methods (in particular, play more with futures and furrr).
  2. Python- Also another language I've already worked with and written scripts for processing of genomic data in delimited text files. But again, would be good to try out some new methods.
  3. Julia- Up-and-coming data science language, supposed to have the dynamism of an R/Python with the speed of a low-level language like C++.
  4. Go or Rust (or both?)- Both popular newer languages. Go seems to have a lot of interesting stuff with parallelism and routines (maybe this relatively small task isn't the best for investigating that?). And Rust is super popular as a low-level language, and I haven't done a lot with that type of language. Might be interesting. Not sure how much community support either language has for common bio file formats.
  5. Others?- Maybe Haskell, if I really want to explore other programming concepts around FP?

skills_test_2

This repository stores input files, example output files, and description of the problem which is designed to assist in the evaluation of computational skills of job candidates and prospective MSc/Ph.D. students.

Problem description

Given a list of genetic variants in compressed Variant Call Format (VCF) files (one file per chromosome) and gene coordinates in compressed GENCODE GTF file, develop a command-line tool which for each variant finds:

  1. List of overlapping genes
  2. List of genes within +/-200000 base pairs from the variant
  3. Nearest gene

Input data and formats

  • Genetic variants are saved in compressed VCF files with the .vcf.gz suffix in the input/ directory. The description of the VCF format can be found at https://samtools.github.io/hts-specs/. If you will find it useful for your solution, the corresponding index files for fast random access (.tbi suffix) are located in the same directory.
  • Gene coordinates are saved in a compressed GENCODE GTF file input/gencode.v38.annotation.gtf.gz. The description of the GENCODE GTF format can be found at https://www.gencodegenes.org/pages/data_format.html.

Output format

The command-line tool must write results to the compressed VCF file. Specifically, for each genetic variant the following key-value pairs must be written into the INFO field:

  1. GENES_IN - comma-separated list of identifiers (gene_id key from GENCODE GTF) of overlapping genes. If there are no overlapping genes, the empty value must be denoted with . symbol i.e. GENES_IN=..
  2. GENES_200KB - comma-separated list of identifiers (gene_id key from GENCODE GTF) of genes which are within +/-200000 base pairs from the variant. If there are no such genes, the empty value must be denoted with . symbol i.e. GENES_200KB=.. GENES_200KB is a superset of GENES_IN.
  3. GENE_NEAREST - identifier (gene_id key from GENCODE GTF) of the nearest gene. If GENE_IN is not empty, then GENE_NEAREST must be empty. Empty value must be denoted with . symbol i.e. GENE_NEAREST=.. An example of the output file can be found in output/example.vcf.gz.

Requirements

To implement this command-line tool you may:

  • use any scripting/programming language (or any combination of them) of your choice (e.g. Python, C/C++, Java, Perl, R, shell scripting)
  • use any open-source library (e.g. for VCF reading/writing, relevant data structures and etc)

Solution

Please, send us a single compressed archive which includes:

  1. README. It should provide: (a) a list of all open-source software tools (and their versions) which were used; (b) any additional requirements for the operating system and/or system libraries; (c) any compilation instructions if such exists (d) detailed step-by-step description on how to run the tool.
  2. Source of your scripts/code. Please include detailed comments in your source code. This will help us better understand your code.

Don't send VCF output files!

Evaluation

The following will be evaluated:

  1. The tool can be easily installed and run.
  2. The tool uses data structures and algorithms adequate for the problem i.e. ensure reasonable running time and memory usage.
  3. All requested INFO fields are present in the final output VCF files.
  4. Values in the requested INFO fields in the final output VCF files are correct.