README file for vcf2diploid tool distribution 1. Compilation ============== $ make 2. Running ========== java -jar vcf2diploid.jar -id sample_id -chr file.fa ... [-vcf file.vcf ...] where sample_id is the ID of individual whose genome is being constructed (e.g., NA12878), file.fa is FASTA file(s) with reference sequence(s), and file.vcf is VCF4.0 file(s) with variants. One can specify multiple FASTA and VCF files at a time. Splitting the whole genome in multiple files (e.g., with one FASTA file per chromosome) reduces memory usage. Amount of memory used by Java can be increased as follows java -Xmx4000m -jar vcf2diploid.jar -id sample_id -chr file.fa ... [-vcf file.vcf ...] You can try the program by running 'test_run.sh' script in the 'example' directory. See also "Important notes" below. 3. Constructing personal annotation and splice-junction library =============================================================== * Using chain file one can lift over annotation of reference genome to personal haplotpes. This can be done with the liftOver tools (see http://hgdownload.cse.ucsc.edu/admin/exe). For example, to lift over Gencode annotation once can do $ liftOver -gff ref_annotation.gtf mat.chain mat_annotation.gtf not_lifted.txt * To construct personal splice-junction library(s) for RNAseq analysis one can use RSEQtools (http://archive.gersteinlab.org/proj/rnaseq/rseqtools). Important notes =============== All characters between '>' and first white space in FASTA header are used internally as chromosome/sequence names. For instance, for the header >chr1 human vcf2diploid will upload the corresponding sequence into the memory under the name 'chr1'. Chromosome/sequence names should be consistent between FASTA and VCF files but omission of 'chr' at the beginning is allows, i.e. 'chr1' and '1' are treated as the same name. The output contains (file formats are described below): 1) FASTA files with sequences for each haplotype. 2) CHAIN files relating paternal/maternal haplotype to the reference genome. 3) MAP files with base correspondence between paternal-maternal-reference sequences. File formats: * FASTA -- see http://www.ncbi.nlm.nih.gov/blast/fasta.shtml * CHAIN -- http://genome.ucsc.edu/goldenPath/help/chain.html * MAP file represents block with equivalent bases in all three haplotypes (paternal, maternal and reference) by one record with indices of the first bases in each haplotype. Non-equivalent bases are represented as separate records with '0' for haplotypes having non-equivalent base (see clarification below). Pat Mat Ref MAP format X X X ____ X X X \ X X X --> P1 M1 R1 X X - -------> P4 M4 0 X X - ,---> 0 M6 R4 - X X --' ,-> P6 M7 R5 X X X ----' X X X X X X For question and comments contact: Alexej Abyzov (alexej.abyzov@yale.edu) and Mark Gerstein (mark.gerstein@yale.edu)