Doud2016 example, error in dms2_bcsubamp
grhuynh opened this issue · 7 comments
I'm trying to run the Doud2016 jupyter notebook, and had 2 issues with running the dms2_batch_bcsubamp under the "Align the deep sequencing data" section.
First, when running within the jupyter notebook, no error was printed although the dms2_batch_bcsubamp was having an error.
Then, I ran the dms2_batch_bcsubamp call directly from the command line, where now I am getting the following error in the log files for all the samples:
INFO - Read refseq of 1698 codons from notebooks/Doud2016/data/WSN-HA.fasta
ERROR - Terminating dms2_bcsubamp with ERROR
Traceback (most recent call last):
File "/data/anaconda/envs/dms/bin/dms2_bcsubamp", line 130, in main
(refseqstart, refseqend, r1start, r2start) = map(int, s.split(","))
ValueError: invalid literal for int() with base 10: '37 286'
I'm not sure how to interpret this error, since I don't think I made any edits to the jupyter notebook. Any ideas on how to interpret this error? Thanks!
What is the full text of the command you typed at the prompt that gave you the second error? Basically, can you provide a minimal working example of what fails, such as a ZIP file with the input / output / bash command.
(dms) gracehuynh@IDRI-ms:/data/home/gracehuynh$ dms2_batch_bcsubamp --batchfile notebooks/Doud2016/results/codoncounts/batch.csv --refseq notebooks/Doud2016/data/WSN-HA.fasta --alignspecs '1,285,36,37 286,570,31,32 571,855,37,32 856,1140,31,36, 1141,1425,29,33 1426,1698,40,43' --outdir notebooks/Doud2016/results/codoncounts --summaryprefix summary --R1trim 200 --R2trim 170 --fastqdir notebooks/Doud2016/results/FASTQ_files/ --ncpus -1 --use_existing 'yes'
INFO:dms2_batch_bcsubamp:Beginning execution of dms2_batch_bcsubamp in directory /data/home/gracehuynh
INFO:dms2_batch_bcsubamp:Progress is being logged to notebooks/Doud2016/results/codoncounts/summary.log
INFO:dms2_batch_bcsubamp:Version information:
Time and date: Tue Aug 20 21:48:53 2019
Platform: Linux-4.15.0-1050-azure-x86_64-with-debian-stretch-sid
Python version: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0]
dms_tools2 version: 2.5.0
Bio version: 1.74
HTSeq version: 0.11.2
pandas version: 0.25.0
numpy version: 1.16.4
IPython version: 7.7.0
jupyter version: 1.0.0
matplotlib version: 3.1.1
plotnine version: 0.5.1
natsort version: 6.0.0
pystan version: 2.16.0.0
scipy version: 1.2.2
seaborn version: 0.9.0
phydmslib version: 2.3.3
statsmodels version: 0.10.1
rpy2 cannot be imported
regex version: 2.5.33
umi_tools version: 1.0.0
INFO:dms2_batch_bcsubamp:Parsed the following arguments:
outdir = notebooks/Doud2016/results/codoncounts
ncpus = -1
use_existing = yes
refseq = notebooks/Doud2016/data/WSN-HA.fasta
alignspecs = ['1,285,36,37 286,570,31,32 571,855,37,32 856,1140,31,36, 1141,1425,29,33 1426,1698,40,43']
bclen = 8
fastqdir = notebooks/Doud2016/results/FASTQ_files/
R2 = None
R1trim = [200]
R2trim = [170]
bclen2 = None
chartype = codon
maxmuts = 4
minq = 15
minreads = 2
minfraccall = 0.95
minconcur = 0.75
sitemask = None
purgeread = 0
purgebc = 0
bcinfo = False
batchfile = notebooks/Doud2016/results/codoncounts/batch.csv
summaryprefix = summary
INFO:dms2_batch_bcsubamp:Parsing sample info from notebooks/Doud2016/results/codoncounts/batch.csv
INFO:dms2_batch_bcsubamp:Read the following sample information:
name,R1
mutDNA-1,mutDNA-1_R1.fastq.gz
mutDNA-2,mutDNA-2_R1.fastq.gz
mutDNA-3,mutDNA-3_R1.fastq.gz
mutvirus-1,mutvirus-1_R1.fastq.gz
mutvirus-2,mutvirus-2_R1.fastq.gz
mutvirus-3,mutvirus-3_R1.fastq.gz
wtDNA,wtDNA_R1.fastq.gz
wtvirus,wtvirus_R1.fastq.gz
INFO:dms2_batch_bcsubamp:Running dms2_bcsubamp on all samples using 4 CPUs...
INFO:dms2_batch_bcsubamp:Completed runs of dms2_bcsubamp.
ERROR:dms2_batch_bcsubamp:Terminating dms2_batch_bcsubamp with ERROR
Traceback (most recent call last):
File "/data/anaconda/envs/dms/bin/dms2_batch_bcsubamp", line 152, in main
'\n'.join(logfiles.values)))
AssertionError: Did not create all these files:
notebooks/Doud2016/results/codoncounts/mutDNA-1_codoncounts.csv
notebooks/Doud2016/results/codoncounts/mutDNA-2_codoncounts.csv
notebooks/Doud2016/results/codoncounts/mutDNA-3_codoncounts.csv
notebooks/Doud2016/results/codoncounts/mutvirus-1_codoncounts.csv
notebooks/Doud2016/results/codoncounts/mutvirus-2_codoncounts.csv
notebooks/Doud2016/results/codoncounts/mutvirus-3_codoncounts.csv
notebooks/Doud2016/results/codoncounts/wtDNA_codoncounts.csv
notebooks/Doud2016/results/codoncounts/wtvirus_codoncounts.csv
Look in following log files for details of what went wrong:
notebooks/Doud2016/results/codoncounts/mutDNA-1.log
notebooks/Doud2016/results/codoncounts/mutDNA-2.log
notebooks/Doud2016/results/codoncounts/mutDNA-3.log
notebooks/Doud2016/results/codoncounts/mutvirus-1.log
notebooks/Doud2016/results/codoncounts/mutvirus-2.log
notebooks/Doud2016/results/codoncounts/mutvirus-3.log
notebooks/Doud2016/results/codoncounts/wtDNA.log
notebooks/Doud2016/results/codoncounts/wtvirus.log
I can't troubleshoot without having access to the actual input / output files. Do you want to make a minimal example, such as with just one or two samples and clipped small FASTQ files, and then send that with the exact commands you ran? Without being able to try to reproduce what you are running, I can't determine if it is some bug in the program or just some issue with your installation / computer.
Sure. I actually am exactly running the Doud2016 example. I figured out that the error is because there shouldn't be single quotation marks around the numbers for the subamplicon alignment specs, so it's trying to run now. Thanks for your help on this!
Second question, do you have a general sense of how much compute is needed? How many CPUs did your team use and how long should it take for the dms2_batch_bcsubamp call in the Doud2016 jupyter notebook? Is it realistic to think this can be run on 4 cores on a virtual machine, or do I need to run this on a cluster? Thanks!
Great, so are the quotes this a bug in the Jupyter notebook that I should fix?
It will not take that long on a four-CPU machine. Downloading the FASTQ files from the SRA will take a while, but the rest will take probably less than an hour. Note it does require quite a bit of RAM.
I'm not sure about the quotes - I wasn't ever able to get it to run from the Jupyter notebook. When I ran it directly from the command line I used the values without any quotes as such:
alignspecs = 1,285,36,37 286,570,31,32 571,855,37,32 856,1140,31,36 1141,1425,29,33 1426,1698,40,43
Also, fyi when it completed I did see several warnings (sample below), which might be due to my own matplotlib installation, but just wanted to put that out there. Thanks for your help!
/data/anaconda/envs/dms/lib/python3.6/site-packages/plotnine/scales/scale.py:93: MatplotlibDeprecationWarning:
The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
if cbook.iterable(self.breaks) and cbook.iterable(self.labels):
/data/anaconda/envs/dms/lib/python3.6/site-packages/plotnine/utils.py:553: MatplotlibDeprecationWarning:
The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
return cbook.iterable(var) and not is_string(var)
OK, thanks.
I'm just going to close this as the source of the problem isn't clear.
The deprecation warnings are from plotnine
, not dms_tools2
. Probably if you upgrade to the newest plotnine
they will go away (do pip install plotnine --upgrade --upgrade-strategy only-if-needed
).