Author: Roy Straver
WISExome was developed and tested using Python2.7. Using any other version may cause errors or faulty results.
The list of packages required and tested versions as reported by pip freeze
(versions working as of 2018-06-14):
matplotlib==2.0.1 (2.2.2)
numpy==1.13.0 (1.14.4)
pysam==0.10.0 (0.14.1)
scipy==0.19.0 (1.1.0)
xlwt==1.2.0 (1.3.0)
To start testing for aberrations, WISExome needs to learn how genomic regions behave when compared to each other. To do so, reference data should be prepared in the same way as the test sample data later on, using the same genomic reference, etc.
There are 2 files you need to download from other sources:
captureregions.bed, a file describing where exactly the regions of interest are.
ucscrefseq.bed, a file describing the exact start and stop positions of genes and their exons.
Folders that will be required during this tutorial (create before running):
convert
leno
occout
refdata
refout
refsamples
testout
testsamples
To convert a .bam file to a file for analysis or training, use the consam.py
conversion script:
python consam.py input/samplename.bam captureregions.bed convert/samplename.hits
At this point the data is workable but we can correct for length variations in the target regions:
python lennormalize.py convert captureregions.bed leno
All reference files should be fed into the reference creation scripts. Move the (length normalized) reference samples (part of the leno/samplename.hits
) into a separate folder (say refsamples
), move the test samples into another folder (say testsamples
). Create a folder called refdata and start the reference-finding script for every combination of chromosomes:
mkdir refdata
for TARGET in `seq 22 -1 1`
do
for REF in `seq 1 1 22`
do
if [ $TARGET != $REF ]
then
python prepref.py refsamples/ refdata/$TARGET.$REF.ref chr$TARGET chr$REF
fi
done
done
The previous step was split per target chromosome and reference chromosome, now the results have to be combined to make a final selection of reference regions for every region on every target chromosome:
python takeref.py refdata/ refout/refname
At this point we're just going to act like we're about to test a sample, except we feed the script training samples, use an empty file to fake we have occurrence counts per region and use the -direct
statement to cut it short:
touch emptyfile
for CHROM in `seq 22 -1 1`
do
for SAMPLE in `ls refsamples/*.hits`
do
# Note: This is tricky when your paths are different,
# -d/ -f2 may need to be altered to fit your path
NAME=`echo $SAMPLE | cut -d/ -f2 | cut -d. -f1`
echo $NAME
python test.py \
$SAMPLE \
refout/refname.$CHROM \
occout/$NAME.$CHROM \
$CHROM \
captureregions.bed \
ucscrefseq.bed \
emptyfile \
-direct \
>> occout/all.$CHROM
done
# Now we can add the received information together
cat occout/all.$CHROM \
| grep 'directCallTag' \
| awk '{count[$1]++}END{for(j in count) print j,count[j]}' \
| sort -n > occout/occurrence.$CHROM
done
To test actual test samples we apply nearly the same script as for the occurrence determination, except now we employ the occurrence files we just created, we feed the script test files, and we may add -plotfile
to enable plotting results. The -plotfile
argument is completely optional and can be omitted if speed is preferred.
Move testsamples to a folder called
for CHROM in `seq 22 -1 1`
do
for SAMPLE in `ls testsamples/*.hits`
do
# Note: This is tricky when your paths are different,
# -d/ -f2 may need to be altered to fit your path
NAME=`echo $SAMPLE | cut -d/ -f2 | cut -d. -f1`
echo $NAME
python test.py \
$SAMPLE \
refout/refname.$CHROM \
testout/$NAME.$CHROM \
$CHROM \
captureregions.bed \
ucscrefseq.bed \
occout/occurrence.$CHROM \
-plotfile
done
done
The data stored after testing (testout/$NAME.$CHROM
above) can be used to create a nice Excel-sheet, including OMIM information. Several samples can be added in one go, enabling the script to determine any overlap between them. This will result in a sample per sheet, and every sample gets a column to every other sample with information how much (relatively, 0-1) a call overlaps with a call in another sample. This is useful when trios are sequenced.
Due to licensing details I currently do not provide an API key for OMIM. You can request one here: https://omim.org/api. Fill your key in between the tickmarks in this line:
apiKey='ENTER YOUR KEY HERE'
near the top of theexcel.py
file. Leave the field blank for a report without OMIM data.
python excel.py outfile sampleA sampleB sampleC
Do not use reference data from one lab to test samples from another. Every reference file, laboratory and sequencing machine has its own effect on how read depth per bin behaves. Any results obtained by combining files from different origins are unreliable.