I found this online skills test and forked it for a project idea. I've been wanting to explore some other programming languages. Inspired by this repo, I was thinking it might be interesting to try executing the same bioinformatics task in a few different languages, just to give myself a framework for learning and comparison.
Might be a little while until I really have time to get to this. For now, some ideas:
- Will start off in R, my "native" language, but might try to implement a couple different ways within R. Could code up a solution based on my current patterns and usage, but I'd also like to try some new packages and methods (in particular, play more with futures and furrr).
- Python- Also another language I've already worked with and written scripts for processing of genomic data in delimited text files. But again, would be good to try out some new methods.
- Julia- Up-and-coming data science language, supposed to have the dynamism of an R/Python with the speed of a low-level language like C++.
- Go or Rust (or both?)- Both popular newer languages. Go seems to have a lot of interesting stuff with parallelism and routines (maybe this relatively small task isn't the best for investigating that?). And Rust is super popular as a low-level language, and I haven't done a lot with that type of language. Might be interesting. Not sure how much community support either language has for common bio file formats.
- Others?- Maybe Haskell, if I really want to explore other programming concepts around FP?
This repository stores input files, example output files, and description of the problem which is designed to assist in the evaluation of computational skills of job candidates and prospective MSc/Ph.D. students.
Given a list of genetic variants in compressed Variant Call Format (VCF) files (one file per chromosome) and gene coordinates in compressed GENCODE GTF file, develop a command-line tool which for each variant finds:
- List of overlapping genes
- List of genes within +/-200000 base pairs from the variant
- Nearest gene
- Genetic variants are saved in compressed VCF files with the
.vcf.gz
suffix in theinput/
directory. The description of the VCF format can be found at https://samtools.github.io/hts-specs/. If you will find it useful for your solution, the corresponding index files for fast random access (.tbi
suffix) are located in the same directory. - Gene coordinates are saved in a compressed GENCODE GTF file
input/gencode.v38.annotation.gtf.gz
. The description of the GENCODE GTF format can be found at https://www.gencodegenes.org/pages/data_format.html.
The command-line tool must write results to the compressed VCF file. Specifically, for each genetic variant the following key-value pairs must be written into the INFO
field:
GENES_IN
- comma-separated list of identifiers (gene_id
key from GENCODE GTF) of overlapping genes. If there are no overlapping genes, the empty value must be denoted with.
symbol i.e.GENES_IN=.
.GENES_200KB
- comma-separated list of identifiers (gene_id
key from GENCODE GTF) of genes which are within +/-200000 base pairs from the variant. If there are no such genes, the empty value must be denoted with.
symbol i.e.GENES_200KB=.
.GENES_200KB
is a superset ofGENES_IN
.GENE_NEAREST
- identifier (gene_id
key from GENCODE GTF) of the nearest gene. IfGENE_IN
is not empty, thenGENE_NEAREST
must be empty. Empty value must be denoted with.
symbol i.e.GENE_NEAREST=.
. An example of the output file can be found inoutput/example.vcf.gz
.
To implement this command-line tool you may:
- use any scripting/programming language (or any combination of them) of your choice (e.g. Python, C/C++, Java, Perl, R, shell scripting)
- use any open-source library (e.g. for VCF reading/writing, relevant data structures and etc)
Please, send us a single compressed archive which includes:
- README. It should provide: (a) a list of all open-source software tools (and their versions) which were used; (b) any additional requirements for the operating system and/or system libraries; (c) any compilation instructions if such exists (d) detailed step-by-step description on how to run the tool.
- Source of your scripts/code. Please include detailed comments in your source code. This will help us better understand your code.
Don't send VCF output files!
The following will be evaluated:
- The tool can be easily installed and run.
- The tool uses data structures and algorithms adequate for the problem i.e. ensure reasonable running time and memory usage.
- All requested INFO fields are present in the final output VCF files.
- Values in the requested INFO fields in the final output VCF files are correct.