marbl/Mash

Mash screen's output format makes parsing slow at scale

Opened this issue · 3 comments

bede commented

Parsing mash screen output (in Pandas at least) is slowed considerably by the use of decimal fractions, parsed initially as a strings, requiring time consuming string wrangling/evaluation in order to convert them to floats. Perhaps the shared hashes could be optionally reported in separate numerator and denominator columns enabling them to be parsed efficiently as integers to begin with?

If anyone knows a faster way to parse a column containing millions of decimal fractions, I would be interested to hear. I'd prefer not to clobber the raw files with e.g. sed prior to parsing if possible.

Thanks,
Bede

Maybe I'm being lame (probably) but I split the shared hashes using pandas string split "/" first into seperate numerator denominator columns then converting to numeric from there. Not sure if I misunderstood your question. My use case for screen generally only gives me hundreds of rows though so I've not got the performance problem of having millions!

bede commented

Hi Carmen,
Yes, my concern is with larger datasets comprising e.g. millions of queries. Using column.str.split() dramatically slows down parsing.

yes it would seem an obvious upgrade to separate, but would also cause everyone to update existing systems - however not a big problem as long as people are aware. Would save me a couple of lines of code :)