sanger-pathogens/ariba

Version 3.0.3 of CARD breaks prepareref

JezSw opened this issue · 9 comments

JezSw commented

Details below. Fixing the version to V3.0.2 with the same setup works.

jez@ubuntu:~$ docker run -i -t sangerpathogens/ariba /bin/bash
root@038c21b11f21:/# ariba getref card out.card
Getting available CARD versions
Downloading "https://card.mcmaster.ca/download" and saving as "download.html" ... done
Found versions:
2.0.0	https://card.mcmaster.ca/download/0/broadstreet-v2.0.0.tar.gz
2.0.1	https://card.mcmaster.ca/download/0/broadstreet-v2.0.1.tar.gz
2.0.2	https://card.mcmaster.ca/download/0/broadstreet-v2.0.2.tar.gz
2.0.3	https://card.mcmaster.ca/download/0/broadstreet-v2.0.3.tar.gz
3.0.0	https://card.mcmaster.ca/download/0/broadstreet-v3.0.0.tar.gz
3.0.1	https://card.mcmaster.ca/download/0/broadstreet-v3.0.1.tar.gz
3.0.2	https://card.mcmaster.ca/download/0/broadstreet-v3.0.2.tar.gz
3.0.3	https://card.mcmaster.ca/download/0/broadstreet-v3.0.3.tar.gz
Getting version 3.0.3
Working in temporary directory /out.card.download
Downloading data from card: https://card.mcmaster.ca/download/0/broadstreet-v3.0.3.tar.gz
syscall: wget -O card.tar.gz https://card.mcmaster.ca/download/0/broadstreet-v3.0.3.tar.gz
...finished downloading
Extracted json data file ./card.json. Reading its contents...
Found 2700 records in the json file. Analysing...
Extracted data and written ARIBA input files

Finished. Final files are:
	/out.card.fa
	/out.card.tsv

You can use them with ARIBA like this:
ariba prepareref -f /out.card.fa -m /out.card.tsv output_directory

If you use this downloaded data, please cite:
"The Comprehensive Antibiotic Resistance Database", McArthur et al 2013, PMID: 23650175
and in your methods say that version 3.0.3 of the database was used
root@038c21b11f21:/# ariba prepareref -f out.card.fa -m out.card.tsv out.card.prepareref
The following command failed with exit code 139
/usr/bin/cdhit-est -i /out.card.prepareref/02.cdhit.noncoding.fa -o /tmp.run_cd-hit.31d170aj/cdhit -c 0.9 -T 1 -s 0.0 -d 0 -bak 1

The output was:

================================================================
Program: CD-HIT, V4.6 (+OpenMP), Aug 14 2016, 11:49:02
Command: /usr/bin/cdhit-est -i
         /out.card.prepareref/02.cdhit.noncoding.fa -o
         /tmp.run_cd-hit.31d170aj/cdhit -c 0.9 -T 1 -s 0.0 -d 0
         -bak 1

Started: Thu Aug  8 11:05:07 2019
================================================================
                            Output                              
----------------------------------------------------------------
total seq: 79

Warning:
Some seqs are too long, please rebuild the program with make parameter MAX_SEQ=new-maximum-length (e.g. make MAX_SEQ=10000000)
Not fatal, but may affect results !!

longest and shortest : 1860132 and 1441
Total letters: 2019859
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 2M
Buffer          : 1 X 408M = 408M
Table           : 1 X 16M = 16M
Miscellaneous   : 4M
Total           : 431M

Table limit with the given memory limit:
Max number of representatives: 40000
Max number of word counting entries: 46092845

comparing sequences from          0  to         79
Segmentation fault (core dumped)

You could try adding the --cdhit_max_memory argument to prepareref (see https://github.com/sanger-pathogens/ariba/wiki/Task:-prepareref) and see if that makes any difference - i.e. bump the memory up for cd-hit.

JezSw commented

I don't think this is a memory issue (see below). It looks like the main solution is to recompile cdhit with the specific flag (see weizhongli/cdhit#26), but this is awkward when there are exisiting docker images etc.

root@6b1996c8c7d4:/# ariba prepareref --cdhit_max_memory 15000 -f out.card.fa -m out.card.tsv out.card.prepareref
The following command failed with exit code 139
/usr/bin/cdhit-est -i /out.card.prepareref/02.cdhit.noncoding.fa -o /tmp.run_cd-hit.0nerjr5y/cdhit -c 0.9 -T 1 -s 0.0 -d 0 -bak 1 -M 15000

The output was:

================================================================
Program: CD-HIT, V4.7 (+OpenMP), Jul 01 2017, 08:43:07
Command: /usr/bin/cdhit-est -i
         /out.card.prepareref/02.cdhit.noncoding.fa -o
         /tmp.run_cd-hit.0nerjr5y/cdhit -c 0.9 -T 1 -s 0.0 -d 0
         -bak 1 -M 15000

Started: Thu Aug  8 14:14:46 2019
================================================================
                            Output                              
----------------------------------------------------------------
total seq: 79

Warning:
Some seqs are too long, please rebuild the program with make parameter MAX_SEQ=new-maximum-length (e.g. make MAX_SEQ=10000000)
Not fatal, but may affect results !!

Ran into the same thing.

There is apparently an element in the database that's longer than the maximum sequence length that CD-HIT allows.

Recompiling CD-HIT yourself with MAX_SEQ=1000000 will fix your issue.

I'll look at doing a change to the docker image next week, to fix this. Thanks for the info.

Looks like there is a 1860132bp sequence in there ("longest and shortest : 1860132 and 1441")?! ARIBA won't like assembling that - it will take far too long.

I'd like to dig into this some more when I have time. I suspect a better fix may be to impose a max length on all reference sequences (at the moment there is only a max length for genes), which would result in that huge sequence getting removed.

Hi Martin. Agreed - fixing in Ariba would be the best way, else we will be effectively forcing users to manually compile cdhit whenever they want to install Ariba and use CARD. To fix the problem at hand it looks like we just need a length check in the _get_from_card method (ref_genes_getter.py). The length to check could be overridden by cmd-line arg if necessary. Removed sequences can be shown in the getref log. I can do the the change unless you want to check it out a bit more?

@fmaguire - thanks for referencing that issue. Good to know CARD are aware.

@kpepper - sounds like a good plan, great if you can do it. I think the cutoff should be applied when prepareref is run, that way it covers everything. This already happens for genes (see https://github.com/sanger-pathogens/ariba/blob/master/ariba/reference_data.py#L28)

I'm guessing same default as the genes would work: 10k? Maybe worth checking the lengths of the non coding seqs in the datasets like CARD etc?

This has been fixed in CARD now. However, release v2.1.4.3 includes two new prepareref arguments that can be used to filter on the length of non-coding sequences so should we have a similar thing happen again, it can be bypassed by adjusting the filtering accordingly. The new arguments are called: --min_noncoding_length and --max_noncoding_length. We can already filter on gene lengths using --min_gene_length and --max_gene_length. Any filtered non-coding sequences will be shown in a new prepareref log file called 01.filter.check_noncoding.log.