haowenz/sigmap

Mapping fails.

mattloose opened this issue · 41 comments

Hi,

I've been trying to run the sigmap code over a number of samples. I first took some data of my own to test using sigmap with default parameters.

A small number of pass reads were mapped using sigmap and compared using the uncalled comparison script with truth showing:

Summary: 8000 reads, 346 mapped (4.33%)

Comparing to reference PAF
     P     N
T   0.00  1.27
F   4.30 94.40
NA: 0.03

Speed            Mean    Median
BP per sec:   5169.82   4719.08
BP mapped:     701.53    402.50
MS to map:     139.14     86.95

Clearly nothing here is mapping correctly.

The same data mapped with uncalled itself shows:

Summary: 8000 reads, 4383 mapped (54.79%)

Comparing to reference PAF
     P     N
T  54.19  1.26
F   0.56 43.95
NA: 0.04

Speed            Mean    Median
BP per sec:   6889.06   6882.87
BP mapped:     329.10    273.00
MS to map:      53.19     42.37

Concerned there may be an issue around the specific reads, I downloaded some of the Chlamydomonas reads referenced in the paper. Unfortunately uncalled is not able to map these reads at all. But apparently, neither can sigmap:

Summary: 4000 reads, 349 mapped (8.72%)

Comparing to reference PAF
     P     N
T   0.00 91.28
F   0.00  0.00
NA: 8.72

Speed            Mean    Median
BP per sec:   8775.22   4915.12
BP mapped:    2673.53   1315.00
MS to map:     714.58    279.56

I am running the sigmap code with the parameters as specified in the defaults (and also mentioned in the text of your paper). Sigmap produces no error messages and outputs paf files as expected. I've tried on multiple ubuntu computers although all running 16.04.

The only thing I can conclude is that the provided kmer model is not compatible with our reads, but it also doesn't appear compatible with the Chlamydomonas reads.

Thanks for any help you can offer.

Hi,

I would like to help on this. At least on the green algae genome, both tools are supposed to work. So maybe I can help on troubleshooting for data on this genome first and then on your own data. Can you provide more details (e.g. the exact data accession and command lines) on how you run all of them (tools and evaluation)? I will reproduce on my side and see what the problem is.

Thanks.

Hi - thanks for getting back to me.

So - the read files I obtained were:

https://sra-download.ncbi.nlm.nih.gov/traces/era20/ERZ/003237/ERR3237140/Chlamydomonas_0.tar.gz

This corresponds to reads from the project PRJEB31789 to the best of my knowledge.

The reference genome was indexed with the following command:

./sigmap -i -r ~/references/Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa.gz -p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model -o Chlamydomonas

A small subset of reads were then mapped using:

./sigmap -m -x Chlamydomonas -r ~/references/Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa.gz -p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model -s test_chlam -o Chlamydomonas_mapping.paf

where the folder test_chlam contains a few reads from the larger set as a single multi fast 5 file. I appreciate that parallel access to this single file is limited by hdf5 but I wanted a simple test set. If useful I can provide the read ids used.

This reference can be indexed by uncalled, but actually trying to map reads leads to a segfault and so these have not been mapped.

The input fast5 files were basecalled using guppy in high accuracy mode and the resulting reads mapped with minimap2:

minimap2 -x map-ont -t 24 ~/references/Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa.gz test_chlam_fastq/*.fastq.gz > Chlamydomonas_mapping_ref.paf

The two resulting paf files were compared using uncalled pafstats:

uncalled pafstats -r Chlamydomonas_mapping_ref.paf  Chlamydomonas_mappings.paf

which gives:

Summary: 4000 reads, 349 mapped (8.72%)

Comparing to reference PAF
     P     N
T   0.00 91.28
F   0.00  0.00
NA: 8.72

Speed            Mean    Median
BP per sec:   8775.22   4915.12
BP mapped:    2673.53   1315.00
MS to map:     714.58    279.56

Our own dataset (happy to share reads with you if useful) is a Zymo dataset. These reads were generated more recently.

Thanks for your help.

I was able to get the same results in the paper on the green algae reads I have. Would you like to share the two paf files (Chlamydomonas_mapping_ref.paf and Chlamydomonas_mappings.paf)? Or if possible share several of your raw reads and their fastq files.

I remembered I had similar results as what you got once long time ago and found out the problem is that the read names in hdf and fastq files don't match each others. Doing the base calling once again solved our problems. But we should be able to see if this is the problem quickly by looking at the pafs. If this is not the reason, you may share the sample data you have and I will have a look in details.

Thanks!

Hi,

Thanks for getting back to me. So - I've rechecked the green algae reads and the paf files and you are correct about a possible issue although resolving it does not solve the problem.

For the record it appears as though the guppy basecaller doesn't maintain file namings on these reads for some reason. I therefore recalled the data and tested on just the reads that were used. The output of pafstats changes a little - but not enough.

It's also worth noting that this problem does not occur on our local test set for which the reads were shared.

uncalled pafstats -r test_chlam/Checked_Chlmaydomonas_ref.paf Chlamydomonas_mappings_redo.paf
Summary: 4000 reads, 349 mapped (8.72%)

Comparing to reference PAF
     P     N
T   0.00 67.45
F   1.18 23.82
NA: 7.55

Speed            Mean    Median
BP per sec:   8379.91   4602.34
BP mapped:    2673.53   1315.00
MS to map:     744.31    283.65

Only around 1000 of the reads in this test set mapped to the reference we are using.

I attach the re-run paf files (gz) for you here so you can have a look.

This is the reference set generated with the following command:

minimap2 --paf-no-hit -x map-ont ~/references/Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa.gz basecalling/*/*.fastq.gz > Checked_Chlmaydomonas_ref.paf

Checked_Chlmaydomonas_ref.paf.zip

This is the output of sigmap generated with:

./sigmap -m -x Chlamydomonas -r ~/references/Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa.gz -p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model -s test_chlam/ -o Chlamydomonas_mappings_redo.paf

Chlamydomonas_mappings_redo.paf.zip

Thanks,

Matt

P.s excuse all the typos in the file namings :-)

What do you mean by " this problem does not occur on our local test set for which the reads were shared"?

This is weird. From the accession number you provided, it seems that we are using the same green algae data. But when I search the read names from your paf files in my minimap2 paf file, I could not find them. And in your minimap2 paf file, almost all the reads are not aligned or not fully aligned (only hundreds of bp out of thousands of bp which is the whole length is aligned). On the contrary, most of my reads were properly aligned by minimap2, which is expected. If the reads are not aligned by their sequences, you cannot expected them to get aligned by their signal, given the current base calling methods can achieve ~90% accuracy. I feel there is still something wrong with the way you processed your data. How did you choose this sample and perform the base calling? And I would appreciate if you can think of other possible issues. Thanks!

Well, I downloaded the data from EBI at https://www.ebi.ac.uk/ena/browser/view/PRJEB31789. I am not sure if NCBI some how magically changed or broke the data, which is less likely to happen but it happens sometimes.

Hi,

Thanks for coming back to me:

What do you mean by " this problem does not occur on our local test set for which the reads were shared"?

So - yes we did have a mismatch in the read_ids for the green algae data but we do not have this for the our local test data set. i.e the explanation doesn't fully resolve the issue.

One question - what reference are you mapping too? I may not be using the correct one?

Also - which files did you use from the ENA?

For basecalling here we used the standard high accuracy mode in the latest guppy version 5.

[edited for clarity]

I am using the reference with the accession number specified in the paper. I checked the name of the file and mine is "Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa", which I believe should be the same as yours (at least the file name is long and the same).

I am using "PRJEB31789 | SAMEA5522908 | ERX3264520 | ERR3237140 Chlamydomonas_0.tar.gz", which also seems to be the same as yours.

For fastq, I think I was using the fastq files inside the "Chlamydomonas_0". At least from the data I downloaded from EBI, their names match. That name mismatch problem happened to me on a yeast dataset. Did you pick all the samples in the fail directory? I guess some of the reads in that folder might not be able to map.

So you may try to sample some reads from pass folder and see if it works. Then if you have time, you might try download the data from EBI, though this is less likely to be the reason. Thanks.

OK - thanks for the update - it's not the reference then - I am re-running using the whole of the Chlamydomonas_0.tar.gz file rather than a random subset. The random subset does include (by the looks) a higher proportion of fail than pass reads - this might explain some of the difference.

Are you mapping to the whole reference or a subset of it? And what is the maximum number of reads you are mapping in one go?

In the paper, we were mapping to the whole reference to show the scalability. But sadly we all know that loading fast5 file is extremely slow. So in your case, to quickly try if it works, you can just use 4000 reads in one folder (e.g., pass/0) and map them. Then evaluate the mapping using the corresponding fastq file. This kind of tests should be able to get finished within 30 min.

Hi,

OK - I've re-downloaded the data from ENA and taken the reads as you suggest from the pass/0 folder. I re-installed sigmap and re-indexed the Chlamydomonas genome. I have used the original basecalls provided in the data as well as re-basecalling them with the latest guppy.

The resulting PAF files are provided here.
Original calls mapped to the reference with minimap2:
Chlamydomonas_0_reads_downloads_pass_0_original_calls.paf.gz
New Guppy basecalls mapped to the reference with minimap2:
Chlamydomonas_0_reads_downloads_pass_0_guppy.paf.gz
Original signal files mapped to the reference by sigmap:
Chlamydomonas_0_reads_downloads_pass_0_sigmap.paf.gz

Testing these three files they all share the same read_ids.

Using uncalled pafstats I can compare these files - original calls vs guppy calls:

uncalled pafstats -r Chlamydomonas_0_reads_downloads_pass_0_original_calls.paf Chlamydomonas_0_reads_downloads_pass_0_guppy.paf
Summary: 5982 reads, 5802 mapped (96.99%)

Comparing to reference PAF
     P     N
T  85.44  2.93
F  10.46  0.08
NA: 1.09

Then if I compare the sigmap mapped file back to the original calls:

uncalled pafstats -r Chlamydomonas_0_reads_downloads_pass_0_original_calls.paf Chlamydomonas_0_reads_downloads_pass_0_sigmap.paf
Summary: 4000 reads, 176 mapped (4.40%)

Comparing to reference PAF
     P     N
T   0.07  5.42
F   4.05 90.17
NA: 0.28

Speed            Mean    Median
BP per sec:   2460.02   2171.06
BP mapped:    4872.63   3066.00
MS to map:    2202.46   1175.81

or if I compare the sigmap mapped file to the guppy calls:

uncalled pafstats -r Chlamydomonas_0_reads_downloads_pass_0_guppy.paf Chlamydomonas_0_reads_downloads_pass_0_sigmap.paf
Summary: 4000 reads, 176 mapped (4.40%)

Comparing to reference PAF
     P     N
T   0.07  4.33
F   4.15 91.28
NA: 0.17

Speed            Mean    Median
BP per sec:   2460.02   2171.06
BP mapped:    4872.63   3066.00
MS to map:    2202.46   1175.81

As before, I am unable to map these signal files with uncalled at all as it crashes when loading the index.

So at present I cannot get sigmap to work on these files.

Just to add - I've uploaded a recent file I've been testing here:
Zymo Test File

This file contains reads mapping to this reference - Zymo Ref

This is a file for mapping with minimap2:
zymo_guppy.paf.zip

This is a file for mapping with uncalled:
test_zymo_uncalled.paf.zip

This is a file for mapping with sigmap:
test_zymo_sigmap.paf.zip

Comparing uncalled with guppy/minimap2 gives:

uncalled pafstats -r zymo_guppy.paf test_zymo_uncalled.paf
Summary: 1000 reads, 886 mapped (88.60%)

Comparing to reference PAF
     P     N
T  85.50  4.40
F   3.00  7.00
NA: 0.10

Speed            Mean    Median
BP per sec:   4552.13   4077.37
BP mapped:     730.92    472.00
MS to map:     181.44    127.94

Comparing sigmap with guppy/minimap2 give:

uncalled pafstats -r zymo_guppy.paf test_zymo_sigmap.paf
Summary: 1000 reads, 53 mapped (5.30%)

Comparing to reference PAF
     P     N
T   0.00  4.10
F   4.90 90.60
NA: 0.40

Speed            Mean    Median
BP per sec:   7242.75   5908.10
BP mapped:    2869.19   1726.00
MS to map:     536.50    263.83

It would be great to know if you can get sigmap to work on these - it's not clear to me why this doesn't work here.

Thanks for showing the results and sorry for the trouble. I will reproduce these two runs myself to find out the problem you may have.

For the algae data, I first built the index:

./sigmap -i -r Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa -p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model -o algae_index
Dimension: 6, max leaf: 20
Reference file: Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa
Pore model file: extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model
Output file: algae_index
Loaded 4096 kmers in 0.0643301s.
Number of sequences: 53.
Number of bases: 111098438.
Loaded all sequences successfully in 0.237047s.
# 11mers: 107033104
Mask 10140484 high frequency kmer in 8.78215s.
Convert 53 sequences to signals in 1.06118s.
Collected 210831526 points.
Built spatial index.
Built index successfully in 389.353s.
Saved in 8.45787s.

Then I mapped the reads using Sigmap:

./sigmap -m -x algae_index -r Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa -p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model -s Chlamydomonas_0/reads/downloads/pass/0/ -o algae_mapping.paf -t 24
Number of threads: 24
Reference file: Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa
Pore model file: extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model
Reference index file: algae_index
Signal directory:Chlamydomonas_0/reads/downloads/pass/0/
Output file: algae_mapping.paf
Loaded 4000 reads in 94.1568s.
Loaded 4096 kmers in 0.010505s.
Number of sequences: 53.
Number of bases: 111098438.
Loaded all sequences successfully in 0.205862s.
Convert 53 sequences to signals in 1.06433s.
Load point cloud successfully! dim: 6, max leaf: 20, point cloud size: 210831526
Memory: 3413556480.
Loaded index successfully in 7.37641s.
Finished mapping in 97.1278, # reads: 4000
Move mappings in 0.000659943s.

Then I mapped the reads using minimap2:

./minimap2 -x map-ont -t 24 Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa Chlamydomonas_0/reads/downloads/pass/*_0.fastq > ~/scratch/algae_minimap2.paf
[M::mm_idx_gen::3.094*1.22] collected minimizers
[M::mm_idx_gen::3.540*1.80] sorted minimizers
[M::main::3.540*1.80] loaded/built the index for 53 target sequence(s)
[M::mm_mapopt_update::3.740*1.76] mid_occ = 129
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 53
[M::mm_idx_stat::3.794*1.75] distinct minimizers: 12540538 (79.96% are singletons); average occurrences: 1.606; average spacing: 5.515; total length: 111098438
[M::worker_pipeline::5.458*4.63] mapped 4000 sequences
[M::main] Version: 2.21-r1071
[M::main] CMD: ./minimap2 -x map-ont -t 24 Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa Chlamydomonas_0/reads/downloads/pass/fastq_runid_4b31c449204a1f1fccf4c922862bdcdddb658410_0.fastq
[M::main] Real time: 5.530 sec; CPU: 25.368 sec; Peak RSS: 1.060 GB

Then I evaluated the results using pafstats from uncalled:

uncalled pafstats -r ~/scratch/algae_minimap2.paf -a algae_mapping.paf > ~/scratch/algae.ann
Summary: 4000 reads, 3294 mapped (82.35%)

Comparing to reference PAF
     P     N
T  79.88  5.50
F   2.27 12.15
NA: 0.20

Speed            Mean    Median
BP per sec:   4312.07   4178.30
BP mapped:    1545.93    892.00
MS to map:     359.71    256.90

I think I used the same test data you were using. The results look normal. So I am not sure what happened to your runs. And I showed all the output from my experiments so you might be able to identify the issue. I will soon run the Zymo data and hopefully we can find the problem.

When I run on the zymo test data you provided. I got the following error.

./sigmap -m -x zymo_index -r zymo/test_data/zymo_D6322.fna.gz -p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model -s zymo/test_data/original/test/ -o zymo_mapping.paf -t 24
Number of threads: 24
Reference file: zymo/test_data/zymo_D6322.fna.gz
Pore model file: extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model
Reference index file: zymo_index
Signal directory: zymo/test_data/original/test/
Output file: zymo_mapping.paf
HDF5-DIAG: Error detected in HDF5 (1.10.7) thread 1:
  #000: H5Dio.c line 198 in H5Dread(): can't read data
    major: Dataset
    minor: Read failed
  #001: H5Dio.c line 599 in H5D__read(): can't read data
    major: Dataset
    minor: Read failed
  #002: H5Dchunk.c line 2589 in H5D__chunk_read(): unable to read raw data chunk
    major: Low-level I/O
    minor: Read failed
  #003: H5Dchunk.c line 3952 in H5D__chunk_lock(): data pipeline read failed
    major: Dataset
    minor: Filter operation failed
  #004: H5Z.c line 1384 in H5Z_pipeline(): required filter 'vbz' is not registered
    major: Data filters
    minor: Read failed
  #005: H5PLint.c line 270 in H5PL_load(): search in path table failed
    major: Plugin for dynamically loaded library
    minor: Can't get value
  #006: H5PLpath.c line 604 in H5PL__find_plugin_in_path_table(): search in path /usr/local/hdf5/lib/plugin encountered an error
    major: Plugin for dynamically loaded library
    minor: Can't get value
  #007: H5PLpath.c line 656 in H5PL__find_plugin_in_path(): can't open directory: /usr/local/hdf5/lib/plugin
    major: Plugin for dynamically loaded library
    minor: Can't open directory or file
terminate called after throwing an instance of 'hdf5_tools::Exception'
  what():  /read_0029e119-bab3-4c00-8835-2deaaef3266d/Raw/Signal: error in H5Dread
Aborted (core dumped)

I assumed "zymo_test_2.fast5" is the sample you generated? I put it into a test directory but failed to load it. When I dumped the file into text. I also got errors. Can you checked if the sample fast5 you provided is corrupted? Thanks.

Something I note from your successful run above - your analysis completes very quickly on 24 cores. I'm not seeing performance like that.

It makes me think there is something missing on the systems I am compiling with. Is there any useful information I can provide you with for that?

The speed issue should come from the hdf5 you are using. Mapping time shouldn't be too long. If possible, can you provide the output when you run the tools together with the command line just like what I showed? Then I can see if there is any difference, which may help.

Yes - sure - i will post them one at a time - especially as I have already noticed a difference:

./sigmap -i -r ~/references/Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa.gz -
p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model -o Chlamydomonas
Dimension: 6, max leaf: 20
Reference file: /home/plzmwl/references/Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa.gz
Pore model file: extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model
Output file: Chlamydomonas
Loaded 4096 kmers in 0.0205829s.
Number of sequences: 53.
Number of bases: 111098438.
Loaded all sequences successfully in 0.858174s.
# 11mers: 107033104
Mask 10140484 high frequency kmer in 7.88047s.
Convert 53 sequences to signals in 1.35307s.
Collected 103579677 points.
Built spatial index.
Built index successfully in 202.166s.
Saved in 10.3449s.

I note the number of points collected here is different to that reported in your command above - which seems odd?

Are you using the release version 0.1?

I followed the install instructions in your readme, which uses the master branch of this repo.

Weird. I am using the master branch without any change too. What is your compiler and system?

gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)

It seems that you lost almost half of the points when building the index, which makes it not possible to get correct mappings.

This is so weird. I have to find a machine with such configuration and see if I can reproduce the error.

Did you change anything (code or Makefile) before you installed it?

I've just tried on a different system and get the same result:

./sigmap -i -r /data/refs/Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa.gz -o Chlamydomonas -p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model
Dimension: 6, max leaf: 20
Reference file: /data/refs/Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa.gz
Pore model file: extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model
Output file: Chlamydomonas
Loaded 4096 kmers in 0.0156798s.
Number of sequences: 53.
Number of bases: 111098438.
Loaded all sequences successfully in 0.67858s.
# 11mers: 107033104
Mask 10140484 high frequency kmer in 7.32922s.
Convert 53 sequences to signals in 1.29215s.
Collected 103579677 points.
Built spatial index.
Built index successfully in 188.467s.
Saved in 13.4878s.

Again - 103579677 points and not 210831526.

This system has the same gcc

gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)

Is it possible for you to try a different gcc version (>8)?

Just replying to your other comment.

I've not changed anything - literally copied and pasted the commands from your README.

I agree - the likely explanation for the discrepancy in results here is the difference in the points being retained.

I've just seen your comment about using a different gcc version - I'll try that - I've just gone with whatever is default on the systems I have here.

I'll update you.

OK - looks like we might have found the issue!

Using gcc-8 I get:

 ./sigmap -i -r ~/references/Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa.gz -p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model -o Chlamydomonas
Dimension: 6, max leaf: 20
Reference file: /home/plzmwl/references/Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.dna.toplevel.fa.gz
Pore model file: extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model
Output file: Chlamydomonas
Loaded 4096 kmers in 0.017905s.
Number of sequences: 53.
Number of bases: 111098438.
Loaded all sequences successfully in 0.867622s.
# 11mers: 107033104
Mask 10140484 high frequency kmer in 8.28548s.
Convert 53 sequences to signals in 1.36537s.
Collected 210831526 points.
Built spatial index.
Built index successfully in 498.476s.
Saved in 20.7004s.

I'll try the next steps now and let you know.

Reply to a previous post. I was able to install the plugin and finish the test on zymo. Results are as follows:

uncalled pafstats -r ~/scratch/zymo_minimap2.paf -a zymo_mapping.paf > ~/scratch/zymo.ann
Summary: 1000 reads, 873 mapped (87.30%)

Comparing to reference PAF
     P     N
T  86.50  4.40
F   0.70  8.30
NA: 0.10

Speed            Mean    Median
BP per sec:   7714.81   7347.09
BP mapped:     866.28    796.00
MS to map:     112.31     92.82

And they look normal (slightly better than uncalled in some aspects). I also want to mention that one should not solely rely on the uncalled pafstats script to evaluate the results. See https://github.com/haowenz/sigmap/tree/master/eval.

Great - I'm just waiting on the green algae results to finish but I suspect we've resolved the issue - now that the index points look the same I would presume this would complete as expected. Thanks for your help!

Yes. It seems that is the problem. Thanks for providing the test data and testing the tools! I will soon update the manual to ask users to use gcc version > 8 to compile.

I am surprised that gcc version can cause this issue. The code for index building is straightforward and is not supposed to have this annoying issue. I will have a look at this later.

OK - I'm still having problems:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

I can't map anything now - I just get this all the time. So it will create the indexes OK but can't map.

Did you make clean and then make after you change the compiler? Can you post the commend lines and the whole output from Sigmap? Also make sure you are using the new constructed index to map reads.

Also can you share your GCC version and system so that I can see if I can reproduce the errors you got and fix it. Thanks!

Yes - I did.

I also went down to the hdf5 folders and did the same thing.

I will post the commands here when i get time - sorry it's proving so difficult.

Sorry for all the troubles you had. The reason I wrote this tool in C++ is to make sure there is no pain when installing and using it. I didn't expect a C++ program would have this kind of issues. Right now I don't have bandwidth on refactoring or improving the code. But the issues you were experiencing told me I really have to do that later. For now I can only try my best to help you resolve the issue you have. Take your time and hopefully we can finally solve the problem.

So looking at the comments from #2 it does not appear to be a gcc issue after all. The author of #2 is able to compile using gcc 5 and get good results.

Just went through the conversation and a bit around the code that computes those "points" as you identified a difference.
Here something is a bit fishy -

(signal_position > 0 && abs(signal_values[signal_position] -
.
@haowenz did you meant fabs() instead of abs()?
@mattloose Try making that abs(signal_values[signal_position] - point_cloud.back().value) > 0.01) to fabs(signal_values[signal_position] - point_cloud.back().value) > 0.01), recompile and check the number of points?

Just went through the conversation and a bit around the code that computes those "points" as you identified a difference.
Here something is a bit fishy -

(signal_position > 0 && abs(signal_values[signal_position] -

.
@haowenz did you meant fabs() instead of abs()?
@mattloose Try making that abs(signal_values[signal_position] - point_cloud.back().value) > 0.01) to fabs(signal_values[signal_position] - point_cloud.back().value) > 0.01), recompile and check the number of points?

Thanks. This is a good catch. It should have been std::abs(), which is overloaded for int, or fabs()/std::fabs(). I pushed a fix for this. But I am not sure if this caused the problem.

I've been able to replicate the results here:

#1 (comment)

To do this I had to recompile using >= gcc-9 .

I also had to convert the multi fast5 format to single fast5 files.

I'll close this issue now - thanks.