skills_test_2
This repository stores input files, example output files, and description of the problem which is designed to assist in the evaluation of computational skills of job candidates and prospective MSc/Ph.D. students.
Problem description
Given a list of genetic variants in compressed Variant Call Format (VCF) files (one file per chromosome) and gene coordinates in compressed GENCODE GTF file, develop a command-line tool which for each variant finds:
- List of overlapping genes
- List of genes within +/-200000 base pairs from the variant
- Nearest gene
Input data and formats
- Genetic variants are saved in compressed VCF files with the
.vcf.gz
suffix in theinput/
directory. The description of the VCF format can be found at https://samtools.github.io/hts-specs/. If you will find it useful for your solution, the corresponding index files for fast random access (.tbi
suffix) are located in the same directory. - Gene coordinates are saved in a compressed GENCODE GTF file
input/gencode.v38.annotation.gtf.gz
. The description of the GENCODE GTF format can be found at https://www.gencodegenes.org/pages/data_format.html.
Output format
The command-line tool must write results to the compressed VCF file. Specifically, for each genetic variant the following key-value pairs must be written into the INFO
field:
GENES_IN
- comma-separated list of identifiers (gene_id
key from GENCODE GTF) of overlapping genes. If there are no overlapping genes, the empty value must be denoted with.
symbol i.e.GENES_IN=.
.GENES_200KB
- comma-separated list of identifiers (gene_id
key from GENCODE GTF) of genes which are within +/-200000 base pairs from the variant. If there are no such genes, the empty value must be denoted with.
symbol i.e.GENES_200KB=.
.GENES_200KB
is a superset ofGENES_IN
.GENE_NEAREST
- identifier (gene_id
key from GENCODE GTF) of the nearest gene. IfGENE_IN
is not empty, thenGENE_NEAREST
must be empty. Empty value must be denoted with.
symbol i.e.GENE_NEAREST=.
. An example of the output file can be found inoutput/example.vcf.gz
.
Requirements
To implement this command-line tool you may:
- use any scripting/programming language (or any combination of them) of your choice (e.g. Python, C/C++, Java, Perl, R, shell scripting)
- use any open-source library (e.g. for VCF reading/writing, relevant data structures and etc)
Solution
Please, send us a single compressed archive which includes:
- README. It should provide: (a) a list of all open-source software tools (and their versions) which were used; (b) any additional requirements for the operating system and/or system libraries; (c) any compilation instructions if such exists (d) detailed step-by-step description on how to run the tool.
- Source of your scripts/code. Please include detailed comments in your source code. This will help us better understand your code.
Don't send VCF output files!
Evaluation
The following will be evaluated:
- The tool can be easily installed and run.
- The tool uses data structures and algorithms adequate for the problem i.e. ensure reasonable running time and memory usage.
- All requested INFO fields are present in the final output VCF files.
- Values in the requested INFO fields in the final output VCF files are correct.