arangrhie/merfin

null output in hist and dump due to missing seek operation

Closed this issue · 5 comments

This might be something that can be updated in meryl-utility, so tagging @brianwalenz just in case.

The first time running merfin -hist, the output has a bunch of nulls/nans. The number of contigs is correct, but they appear to be uninitialised. Running the exact same command again then works correctly.

This doesn't appear to happen for -vmer, so after some digging, I noticed that the vmer critical block has

sfile->findSequence(seqId);
sfile->loadSequence(seq);

while for hist and dump it is only the loading.

It looks like findSequence from mery-utility is just a seek operation. I don't know too much about seeks, but presumably after generating the index, the loadSequence is just reading from the end of the index and hence the nulls.

After adding in the findSequence commands before the loading in hist and dump made this problem go away.

Generating fasta index.

Number of contigs: 712
0       (null)  0       0       0       nan
1       (null)  0       0       0       nan
2       (null)  0       0       0       nan
3       (null)  0       0       0       nan
gf777 commented

Hi @ASLeonard,

Merfin is still under active development..thanks for your help in debugging :-)

That said, yesterday I reverted the master to the Jan 31 version, which in my hands was stable. Some changes that were introduced later (I suspect also some commits to the kmer counter meryl) broke it, hence I reversed it (and I believe I also was having those kind of problems). Can you confirm that you are using the latest version?

I hadn't updated to the latest code before noticing this.

I would guess the current "old" code has this part, which is calling for a new dnaSeqFile (and internally seems to be calling the compressed file reopen() in its constructor).

sfile->generateIndex();
uint64 ctgn = sfile->numberOfSequences();
sfile = new dnaSeqFile(seqName);

I couldn't explicitly confirm in meryl-utility, but presumably your reverted sfile = new dnaSeqFile(seqName); has a similar effect as sfile->findSequence(seqId); inside the loop.

gf777 commented

this is @brianwalenz's magic :-) I can only confirm that the old version worked (after a long trial and error process by myself :-) ) and the new one didn't not. Not sure if it is also due to some changes in meryl-utility

It seems to be working fine now after that patch, and the unparallelised version would be very slow to be doing so many iterations, so I probably won't revert for now.
This can probably be closed as there is a stable revert now and Brian will probably still see this after being tagged.

gf777 commented

Cool, thanks I'll close the issue!