How would be the commands to run HapCUT2 with a combination of data?
rcosentino opened this issue · 15 comments
Hi,
this is not really an issue, but I could not find a better place to ask this. In the description and in the paper says that it is possible to combine datasets from different technologies to do the haplotyping but I cannot find how it should be done. I would like to combine SMRT reads and Hi-C data for haplotyping. I would be grateful if somebody can help me.
Best,
Raúl
Hi rcosentino,
I have been working on adding pre-written pipelines for experimental scenarios such as this to the recipes directory. I did not have one prepared for HiC+PacBio but I did for HiC+10X. I tweaked the HiC+10X pipeline and the instructions for a generic long read scenario, and placed it here:
https://github.com/pjedge/hapcut2/tree/master/recipes/HiC_Longread
I don't have time right now to test it, but I did make sure the new snakemake workflow is valid. I'm confident it should work. Please let me know if you have any issues running it with your dataset.
Hi pjedge,
thank you very much for the fast answer and the recipe for combining HiC and PacBio data. I never used snakemake workflows before and it took me a while to understand how it goes and adapt it to my data (it seems like a really usefull tool, though). The workflow seems to be valid, but when i run it, it crashes in the last step (hapcut2). But I think that the problem is already before because the output files of extracthairs are empty (both, the ones generated from the hic data and the ones generated from the PacBio data). I can not figure out what could be the problem. I am sending attached the last lines of the standard output including the error message.
last_lines_stdoutput_hapcut2_error_rcosentino.txt
Do you have a small test-dataset (maybe a set of short reads, a small genome and a vcf file) that I could use to test?
Best regards,
Raúl
I could get something together if need be. A common cause for empty fragment output is that the chromosome names are not matching (e.g. vcf has '1' while bams have 'chr1'). Is this the case for you? Can I see the first few lines of the bam files, and the first few lines of the vcf?
Hi pjedge,
The chromosome that I am using for the test are named 'Chr8_core_Tb427v8' and 'Chr9_core_Tb427v8', but they are just a subset of all the chromosomes ( Could there be an issue with underscore symbols?)
I am sending attached a file with the first few lines of one of the vcf files (the one from 'Chr8_core_Tb427v8') and the Hi-C bam file (grepping 'Chr8_core_Tb427v8').
head_hapcut2_bam_vcf_file_rcosentino.txt
Thanks for your help,
Best,
Raúl
How many heterozygous variants are there in the VCF?
They should be all (or most of them) heterozygous, because to generate the vcf file I mapped the same reads that i used to generate the genome assembly (and that is what I see when I visualize the mapped reads in a genome browser). But looking at the VCF file, it does not seem to be properly annotated in the info... or I do not know where to look at. I generated the vcf files using freebayes and short-read error-corrected PacBio reads. Attached I am sending a longer section of one of the VCF files in case you want to look at it.
here is the attached file
head20_vcf_file_rcosentino.txt
I have the same problem as @rcosentino. But the data I want to combine are Hi-C and WGS. I hope WGS can increase the resolution while Hi-C can yield chromosome long haplotype. Are there any pipeline or sample commands available? Really appreciated.
@rcosentino:
Please check that the heterozygous positions are present and annotated correctly. Keep in mind that only heterozygous positions that are properly annotated in the VCF (1/0, 0/1, 0|1, or 1|0 in the GT field) matter to extractHairs. If there are no such variants or no reads that span two such variants, then the fragment file will be empty because there are no haplotype-informative reads, and this would cause your problem. I notice all the positions in your sample file are 0/0 so they will be thrown out. Change them to 0/1 if you want HapCUT2 to use them. Also, note that the heterozygous calls used in HapCUT2 should be relatively confidently called. So if you think the calls are a little dubious it could hurt the assembled haplotype quality.
@serein927
The Pipeline I linked above (copied here) should work fine for HiC and WGS, since there are no differences in the commands for WGS and PacBio reads:
https://github.com/pjedge/hapcut2/tree/master/recipes/HiC_Longread
just use your WGS bamfile in place of the longread bam file. As with @rcosentino, be sure that your input VCFs have confident, properly annotated heterozygous calls.
Hi pjedge,
Thank you for all the help, changing the 0/0 to 0/1 in the VCF files fixed the issue, but does not seem to be a proper solution in the long term... I do not know why they are not being properly tagged with freebayes. Which tool do you use(recommend) to generate VCF files?
Best regards,
Raúl
@rcosentino look at your AD
genotype field, they are all 0,34
telling me that your calls should be homozygous alternate (1/1
). It's very odd that freebayes produced the wrong genotype. Could you post an issue on the freebayes site?
Agreed. @rcosentino if it's true that you assembled these reads into a genome, and then re-aligned the same reads to that genome as reference and called variants, then the only two possibilities should be:
(1) homozygous allele: shows up as reference call when realigned to own assembly
(2) heterozygous allele: shows up as heterozygous variant call when realigned to own assembly
How are there no "reference" (present in assembly) alleles present at your variant calls? It seems impossible that these reads are aligned to their own genome assembly, outside of alignment issues. For each of these positions, shouldn't the allele present in the reference just be the "alternate" allele that freebayes calls?
Thanks for your answers @nh13 and @pjedge . I posted an issue on the freebayes site (freebayes/freebayes#374). About your questions @pjedge , I am not sure If I understood them ... the VCF file does has alleles for REF in every variant call ... Could it be that you looked at the wrong column (the previos column (ID column) is the one that has missing values, '.') or did I misunderstood the question?
If you look at the AD field (third field in the last column), I'm mostly seeing things like 0,34 -- this means 0 reference, 34 alternate.
Also, it's strange to me that despite those allele counts, the calls are 0/0. Although I suppose the qualities are very low.
Yes, something is wrong, probably my mistake, but I don't know what and how to fix it, let see if the people from freebayes give me some hints.
Out of topic, what do the values in the ".htrans_model" output file represent?