Unable to run msmc2 on a single sample

Question

Unable to run msmc2 on a single sample

Buffan3369 opened this issue 2 years ago · 4 comments

Hi Stephan,

First of all, thanks a lot for making such an amazing tool that detailed and user-friendly.
I'm currently facing a very similar issue as the one raised here 6 years ago, that is that I cannot run MSMC2 in a single-haplotype case. I implemented a command like this

~/msmc2-2.1.3/build/release/msmc2 -I 0,1 -r 5 -t 6 -p 1*2+25*1+1*2+1*3 -o $PATH_TO_OUTPUT/output.msmc2 $PATH_TO_INPUT/msmc_input.txt

Which triggered this very strange and obscure error message (note that adding or removing the compared haplotypes -I 0,1, as it seemed to work with the person that raised the point 6 years ago, didn't change anything in my case):

error in parsing command line: object.Exception@model/data.d(111): Enforcement failed

??:? pure @safe noreturn std.exception.bailOut!(Exception).bailOut(immutable(char)[], ulong, scope const(char)[]) [0x55a7f9df5716]
??:? pure @safe bool std.exception.enforce!().enforce!(bool).enforce(bool, lazy const(char)[], immutable(char)[], ulong) [0x55a7f9df5683]
??:? model.data.SegSite_t[][] model.data.readSegSites(immutable(char)[], ulong[2][], bool) [0x55a7f9e55246]
??:? model.data.SegSite_t[][] msmc2.readDataFromFiles(immutable(char)[][], ulong[2][], bool) [0x55a7f9e79ac1]
??:? void msmc2.parseCommandLine(immutable(char)[][]) [0x55a7f9e7804d]
??:? _Dmain [0x55a7f9e77997]

I basically checked 100 times the format of my input file and, at least in appearance, everything looks fine. I was able to run bamCaller.py and generate_multihetsep.py to finally get an input like this (with 44560 finally retained SNPs)

WXVQ01000001.1 2006 1682 AG
WXVQ01000001.1 2177 171 AG
WXVQ01000001.1 5306 2892 GT
WXVQ01000001.1 7278 1972 CT
WXVQ01000001.1 7604 326 AT
WXVQ01000001.1 7781 177 CT
WXVQ01000001.1 8024 234 AG
WXVQ01000001.1 8325 301 AG
WXVQ01000001.1 8670 345 GA
WXVQ01000001.1 11868 2884 CT
WXVQ01000001.1 11906 38 CT
WXVQ01000001.1 12921 900 GA
WXVQ01000001.1 13110 189 AT
WXVQ01000001.1 13237 127 TC
WXVQ01000001.1 13360 123 GA
WXVQ01000001.1 13403 43 GC
WXVQ01000001.1 14354 951 AG
WXVQ01000001.1 14394 40 TC
WXVQ01000001.1 14690 296 CT
WXVQ01000001.1 14750 60 AG
WXVQ01000001.1 14843 93 GA
WXVQ01000001.1 15394 551 GT
WXVQ01000001.1 15396 2 TG
WXVQ01000001.1 17693 2276 TA
WXVQ01000001.1 17804 111 CT
WXVQ01000001.1 19772 1968 CT
WXVQ01000001.1 20337 377 AC
WXVQ01000001.1 20402 65 GA
WXVQ01000001.1 20419 17 GA
WXVQ01000001.1 20528 109 GC

I also had no difficulties to implement your tutorial, so I think my parameterisation of the software has no issue neither.
Do you have an idea about to what it might be related to?

I may add that when I compiled msmc2, I got an error message like: "Deprecation: usage of the body keyword is deprecated. Use do instead." in five D-scripts from the original repo. Not knowing anything about D programming, I therefore manually replaced all the body commands by a do one, and compilation further seemed to work.
Also, I am not properly working at the chromosome level, but at the contig one. Hence, as generate_multihetsep.py only works "for a single chromosome", I splitted my reference sequence and applied the python script to as many subsets as I had contigs, that is, 992. Afterwards, I re-assembled everything together to get the table of which I pasted the first lines. Maybe it might be the cause of my issue, but I don't see why. Here are my data preparation codes from the BAM file (note that I have a converage depth of 221X):

#generate, for each contig, vcf and individual mask file (the latter containing the regions at which the individual has enough data)
bcftools mpileup -q 20 -Q 20 -C 50 -f reference.fasta paired_end_alignment.sorted.bam | bcftools call -c -V indels
| bgzip > intermediate_var_call.vcf.gz
bcftools index $MAIN_PATH/intermediate_var_call.vcf.gz`

#loop over all contigs, of which the names all contain a "W", hence the argument
for TIG in $(grep "W" reference.fasta)
do
CONTIG=${TIG#*>}
bcftools view intermediate_var_call.vcf.gz --regions $CONTIG |
python3 ./bamCaller.py 221 msmc_input/individual_masks/$CONTIG.mask.bed.gz | gzip -c >
msmc_input/individual_vcfs/nitz4.$CONTIG.vcf.gz #mask and vcf files generation
python3 ./generate_multihetsep.py --mask msmc_input/individual_masks/$CONTIG.mask.bed.gz
msmc_input/individual_vcfs/$CONTIG.vcf.gz >
msmc_input/individual_msmc_inputs/$CONTIG.multihetsep.txt
done

#now assemble the 992 files into one by stacking them together
cat msmc_input/individual_msmc_inputs/*.multihetsep.txt >> msmc_input/full.multihetsep.txt

I hope my issue is solvable... Anyway, thanks in advance!
All the best,
Lucas

Answer 1 · 2022-11-02T12:58:52.000Z

Thanks for the detailed documentation on what you did. I'll try to find out what caused the error. Note that the deprecation warnings about the body keyword are already on my radar. I will have to fix some code for those to go away.

One thing you could try immediately is chancing your chromosome name from WXVQ01000001.1 to something that doesn't contain a dot? I can't remember now what I enforce the chromosome name to be, but it's something to check. I will get back to you once I had a chance to check the code.

Answer 2 · 2022-11-02T13:46:21.000Z

Hi Stephan,

Thanks for your indications. I indeed changed the chromosome names to something simpler (from 1 to 992 as I had 992 scaffolds), but it seems like nothing changed.

I was maybe thinking that it might have something to do with the fact that I producedthe MSMC input without mappability mask, as I didn't have one (though the makeMappabilityMask.py script exists, this one requires itself a mappabilty mask + a FASTA file to produce a .BED file). From what I understood, such a mask just indicates the "regions of the genome where we think the mapping is of high quality", and is therefore not compulsory as long as we have at least the VCF mask, pointing the homozygous regions between called sites. But maybe it has a role to play?

I also explored the eventuality of a format error. For that respect, I tried to build by hand a 4-columns textfile that would mimic the input of msmc2, the latter being fairly simple in appearance. However, I have always received the same error as the one I reported when I tried to use my "handmade" input... EXCEPT when I transferred the information from the input produced on human chromosome 1 via your tutorial (EUR_AFR.chr1.multihetsep.txt) towards this handmade input. It's really strange. Hence, I don't think it's a format error. But I might be misleading.

All in all, thanks for considering my issue. Really appreciate. If you feel it might be necessary, I can share my files to you.
Best,
Lucas

Answer 3 · 2022-11-09T11:41:30.000Z

OK, since I have little time to test things, I simply had a look into the code in data.d (line 111) as indicated in the error. It's this line:

enforce(nrCalledSites <= pos - lastPos);

This enforces that the number of called sites is lower or equal than the distance from the last site to the focal one. Each row in the input file corresponds to a (typically polymorphic) site, and in the third column also the number of homozygous reference calls from the last position to this position. Obviously, you can't have more homozygous reference calls than simply alls sites since the last position. So you must have generated your input file with an error in that respect.

Makes sense?

Answer 4 · 2022-11-09T12:57:40.000Z

Yes, it makes sense, thanks. I'm going to check the line that don't fulfill this condition. Thanks Stephan