marbl/Mash

Checking if two mash sketches are identical..

Closed this issue · 0 comments

Hi,

Is it possible to check if two mash sketches are identical (e.g. mash dist would return a distance of 0) without utilizing mash dist

For example,

diff <(tail -n +2 file1.msh) <(tail -n +2 file2.msh)

works with some sketches but not all. For others e.g.

`diff \
    <(xxd file1.msh | awk 'x==1 {$1="";print $0} /........ACGT..../ {x=1}') \
    <(xxd file2.msh | awk 'x==1 {$1="";print $0} /........ACGT..../ {x=1}')`

can work, but again, it's not universal. Is there a one liner out there which has 100% success rate of dumping just the part about the k-mers to stdout?

This would be very useful if you have e.g. one million mash sketches and you want to remove redundancy without doing all vs all. You could e.g. just have a list of hashes of mash sketches (excluding the filename/comment/etc. parts of the sketches).

Nevermind, I did not realize that mash info now had options..