AnantharamanLab/vRhyme

No bins generated, bug or feature?

Joon-Klaps opened this issue · 2 comments

I've been running vRhyme on some of my test data SRR11140750-test.zip and vRhyme doesn't generate any output bins (along with some other files).

I'm curious why this is. If vRhyme doesn't determine any bins is this because they all represent a distinct viral genome/segment (but then I would suspect bins with only one sequence in them). If so, it would be good to have a warning mentioning that no sequences were binned. Or is this kind of output unintentional?

Thanks in advance!

Docker container used: quay.io/biocontainers/vrhyme:1.1.0--pyhdfd78af_1
Command used:

vRhyme \
    -i SRR11140750.fa \
    -r SRR11140750_host.unmapped_1.fastq.gz SRR11140750_host.unmapped_2.fastq.gz \
    -o SRR11140750 \
    -t 4 \
    --verbose

Output structure:

$ tree SRR11140750
SRR11140750
├── log_vRhyme_paired_reads.tsv
├── log_vRhyme_SRR11140750.log
├── SRR11140750.circular.tsv
├── vRhyme_bam_files
│   └── SRR11140750_host.unmapped_1.sorted.bam
└── vRhyme_coverage_files
    ├── SRR11140750_host.unmapped_1.coverage.tsv
    ├── vRhyme_coverage_values.tsv
    └── vRhyme_names.txt
2 directories, 7 files

Log file:

Command:  /usr/local/bin/vRhyme -i SRR11140750.fa -r SRR11140750_host.unmapped_1.fastq.gz SRR11140750_host.unmapped_2.fastq.gz -o SRR11140750 -t 4 --verbose

Date:     2023-11-30 (y-m-d)
Start:    17:35:51   (h:m:s)
Program:  vRhyme v1.1.0


Time (min) |  Log                                                   
--------------------------------------------------------------------
0.0           Initializing and validating vRhyme parameters
0.01          Paired end read file(s) identified. Running bowtie2 on 1 set of paired files
              Caution: vRhyme performs optimally with 3+ samples
0.11          Extracting coverage information from BAM files
0.14          Coverage extraction complete. Generating coverage table
0.14          Performing pairwise coverage comparisons
0.14          vRhyme binning complete

Memory usage:       0.18
Runtime (min):      0.14
Bins generated:     0
Binned sequences:   0 (0%)
Input sequences:    42
Binned proteins:    0
Redundant proteins: 0 (0%)
Best iteration:     none
vRhyme score:       none

Output test:

 Python Dependencies
  -------------------
  scikit-learn: Success (v1.2.2)
  numpy: Success (v1.23.5)
  numba: Success (v0.56.4)
  pandas: Success (v2.0.0)
  pysam: Success (v0.21.0)
  networkx: Success (v3.1)


  Program Dependencies
  --------------------
  mmseqs: Success
  samtools: Success
  prodigal: Success
  mash: Success
> nucmer: Not Found! Optional
  bowtie2: Success
  bwa: Success


  Machine Learning Models
  -----------------------
  NN model: Success
  ET model: Success

*Edit: typo

By default vRhyme does not generate any singleton bins. Any sequence not binned is either not a virus, a single virus, a fragment without sufficient information to bin, or vRhyme made an error by not binning it. There are many reasons for it. Are you binning viral sequences or a mix of viral and non-viral? You only have 42 input sequences and 1 sample so I'd assume there is just little information to go off.

Hi @KrisKieft, thanks for the response! These results are from a test dataset with exclusively only viral genomes (all complete covid genomes or fragments of the genome). There is a low depth in general, with it not exceeding 5x. If I were to provide vrhyme a list of bam files containing only one sequence and all the reads mapped towards it would
this be better? Maybe I'm not entirely following the concept samples as it seems counterintuitive to me to determine coverage covariance from a sequence run1 with sequence run2 if run 1 comes from patient A and run 2 comes from patient B.

How can I feed vRhyme data in the best way possible coming from a whole pipeline perspective (read->contig->vrhyme) where input samples might not always be related?