seqan/chopper

Output related questions

JensUweUlrich opened this issue · 3 comments

Hi,

I have some questions regarding the output of chopper layout. I tried to calculate the layout for the viral refseq and got the following header lines as part of the output

#HIGH_LEVEL_IBF max_bin_id:173
#MERGED_BIN_0 max_bin_id:153
#MERGED_BIN_1 max_bin_id:196
#MERGED_BIN_2 max_bin_id:117
#MERGED_BIN_3 max_bin_id:253
#MERGED_BIN_4 max_bin_id:127
#MERGED_BIN_5 max_bin_id:167
.
.
.
#MERGED_BIN_447 max_bin_id:34
#FILES  BIN_INDICES     NUMBER_OF_BINS
files.renamed/GCF_002826665.1_genomic.fna.gz    0;0     1;1
files.renamed/GCF_002219365.1_genomic.fna.gz    0;1     1;1
files.renamed/GCF_003847265.1_genomic.fna.gz    0;2     1;1
files.renamed/GCF_002826065.1_genomic.fna.gz    0;3     1;2
files.renamed/GCF_000915375.1_genomic.fna.gz    0;5     1;1
.
.
.
iles.renamed/GCF_001995575.1_genomic.fna.gz    432     1
files.renamed/GCF_001041755.1_genomic.fna.gz    433;0   1;35
files.renamed/GCF_001502095.1_genomic.fna.gz    433;35  1;29
files.renamed/GCF_000903335.1_genomic.fna.gz    434     1
files.renamed/GCF_002116175.1_genomic.fna.gz    435;0   1;35
files.renamed/GCF_016811445.1_genomic.fna.gz    435;35  1;29
files.renamed/GCF_001308775.1_genomic.fna.gz    436     1
files.renamed/GCF_001041035.1_genomic.fna.gz    437     1
files.renamed/GCF_000865825.1_genomic.fna.gz    438     1
files.renamed/GCF_002826725.1_genomic.fna.gz    439;0   1;40
files.renamed/GCF_000839765.1_genomic.fna.gz    439;40  1;24
files.renamed/GCF_001602085.1_genomic.fna.gz    440     1
files.renamed/GCF_002628245.1_genomic.fna.gz    441     1
files.renamed/GCF_000887095.1_genomic.fna.gz    442     1
files.renamed/GCF_000924835.1_genomic.fna.gz    443     1
files.renamed/GCF_000922335.1_genomic.fna.gz    444     1
files.renamed/GCF_001654305.1_genomic.fna.gz    445     1
files.renamed/GCF_000848085.2_genomic.fna.gz    446;0   1;32
files.renamed/GCF_000923135.1_genomic.fna.gz    446;32  1;32
files.renamed/GCF_001316375.1_genomic.fna.gz    447;0   1;34
files.renamed/GCF_000875305.1_genomic.fna.gz    447;34  1;30
files.renamed/GCF_000893455.1_genomic.fna.gz    448     1

As far as I can see, these are all merged bins, but what does max_bin_id refer to? And how can I infer the topology of the hierarchy from the output?
How can I interpret the BIN_INDICES and NUMBER_OF_BINS columns?

Cheers
Jens

Hi Jens,

Thanks for reaching out.

So the layout file was designed to be easily readable by Raptor to build the index rather than for users to interpret.
But I should still document this some more.

The max_bin_id refers to a single bin in the respective interleaved bloom filter of a merged bin that is expected to store the highest amount of kmers. The size of the interleaved bloom filter must be calculated from the largest individual bin s.t. the maximum false positive rate is always guaranteed.
This information is important to build the index but rather uninteresting for you I think.

Now to the hierarchy.
First, both variables are a list of numbers separated by ;. The colon separates the levels. So if an entry has 3 numbers the respective file/user-bin is stored across 3 levels.

The BIN_INDICIES are identifiers to report in with technical bins of an IBF within the HIBF the file/user-bin belongs to. Files having the same bin index at a certain level position are within the same merged bin (and thereby also in the same low level IBF)

NUMBER_OF_BINS Reports how many technical bins the file is occupying at each level. This cannot be more than one for each level except the last (it's a little verbose). A number higher than one means the file content was split into several bins.

Let's look at an easy example

File1    0      2
File2    2      1

This means there is only a single level. File 1 is split into two technical bins (TB), namely TB-0 and TB-1. Thus file2, who is only stored in a single bin, is stored in TB-3.

Next example

File1    0;0     1;5
File2    0;5     1;5
File3    1        9

Now there are two levels.
On the top level, File1 and File2 are merged into a single bin TB-0. File3 ia split into 9 technical bins TB-1 to a
TB-9. There is one merged bin on the top level, thus there is one low level IBF. On the low level IBF File1 and File2 are each split into 5 bins.

I hope this is helpful?

Then you can probably parse the layout.
E.g. the following gives you the maximum number of levels:

grep -v "^#" RefSeqCG_arc_bac.192.layout | cut -f 2 | awk 'BEGIN{FS=";"}{if(NF > LEV) {LEV=NF}}END{print LEV}'

Which information are you interested in? Maybe I'll add a chopper stats if this turns out to be used often.

Hi Svenja,

thanks for your quick reply and shedding light on the output. I was just playing around with the HIBF. I used the entire viral refseq and computed a layout with chopper but was stuck because I did not understand the structure of the HIBF that would result when building it with raptor. I'm currently thinking about applying your approach for my tools as well. Maybe a better documentation would suffice for the beginning ;-)

Hi Jens,

I'll work on the documentation!

We are currently thinking about how we could move the HIBF to seqan3 s.t. the data structure is available without chopper/raptor.
Would you be interested in that or rather is your use case one that would benefit from a header only "seqan3::hibf" data structure?

Best,
Svenja