running proali from custom gff files
aleksicmil opened this issue · 2 comments
Hi,
I am trying to run anchorwave as an alignment step of the buckler lab PHG pipeline.
proali step is failing, and seems to me like the problem is that refCDS.fa (and all other files generated in the preparatory steps) are empty. AssemblyMAFFromAnchorWavePlugin from PHG runs the following command:
anchorwave gff2seq -r /path/to/MAIZE_B73.fa -i /path/to/ensembl_corn_v36-gene-model.noscaffolds.gff -o /path/to/refCDS.fa
My GFF file looks like this (just first gene is showed):
1 gramene gene 44289 49837 0 + . ID=Zm00001d027230;Name=Zm00001d027230;
1 gramene mRNA 44289 49837 0 + . ID=Zm00001d027230_T001;Parent=Zm00001d027230;Name=Zm00001d027230_T001;biotype=protein_coding;Note=Mitochondrial transcription termination factor family protein;
1 gramene polypeptide 44351 47995 0 + . ID=Zm00001d027230_P001;Derives_from=Zm00001d027230_T001;Name=Zm00001d027230_P001;
1 gramene exon 44289 44947 0 + . Name=Zm00001d027230_T001.e1;Parent=Zm00001d027230_T001;
1 gramene exon 45666 45803 0 + . Name=Zm00001d027230_T001.e2;Parent=Zm00001d027230_T001;
1 gramene exon 45888 46133 0 + . Name=Zm00001d027230_T001.e3;Parent=Zm00001d027230_T001;
1 gramene exon 46229 46342 0 + . Name=Zm00001d027230_T001.e4;Parent=Zm00001d027230_T001;
1 gramene exon 46451 46633 0 + . Name=Zm00001d027230_T001.e5;Parent=Zm00001d027230_T001;
1 gramene exon 47045 47262 0 + . Name=Zm00001d027230_T001.e6;Parent=Zm00001d027230_T001;
1 gramene exon 47650 48111 0 + . Name=Zm00001d027230_T001.e7;Parent=Zm00001d027230_T001;
1 gramene exon 48200 49247 0 + . Name=Zm00001d027230_T001.e8;Parent=Zm00001d027230_T001;
1 gramene exon 49330 49837 0 + . Name=Zm00001d027230_T001.e9;Parent=Zm00001d027230_T001;
Could you please tell me what could cause the issue with an empty output refCDS.fa file? Is there something in the GFF that is not correct? Thanks.
Best,
Milica
Followup:
I continued working on this today as I figured there must be something wrong with the gff file. Since I am working with corn, I tried using ftp://ftp.ensemblgenomes.org/pub/plants/release-34/gff3/zea_mays/Zea_mays.AGPv4.34.gff3.gz
from https://github.com/baoxingsong/genomeAlignment/tree/main/alignb73againstmo17. The only change I did to the file was to remove the scaffolds.
So now, I am able to create refCDS.fa that is not empty and the SAM file, but when I run proali using the following command:
anchorwave proali -i Zea_mays.AGPv4.34.noscaffolds.gff3
-r MAIZE_B73.fa
-as refCDS.fa
-a MAIZE_DK105.sam
-ar MAIZE_B73.sam
-s MAIZE_DK105.fa
-n MAIZE_DK105_MAIZE_B73.anchorspro
-R 1 -Q 1 -t 4
-o MAIZE_DK105.maf
I am getting this error:
setupAnchorsWithanchorwave_avx2: /opt/conda/conda-bld/anchorwave_1676953230719/work/src/service/TransferGffWithNucmerResult.cpp:450: void readSam(std::vector<AlignmentMatch>&, std::ifstream&, std::map<std::__cxx11::basic_string<char>, Transcript>&, int&, const double&, double&, std::set<std::__cxx11::basic_string<char> >&, int32_t&, int32_t&, int32_t&, int32_t&, int&, bool&, int&): Assertion `transcriptHashMap[elems[0]].getPStart() == chromosomePosition' failed.
/opt/miniconda/bin/anchorwave: line 22: 184 Aborted (core dumped) "${EXE}_avx2" "$@"
By checking some other issues on this page (Issues #2 or #16), seems to me like the issue is still in the GFF file, which as a result creates SAM file that anchorwave can't process.
Could you please share some guidelines on how I can ensure that my GFF file is ok? Or, if there are any other things that I need to ensure before running proali?
Thank you very much in advance, your help is much appreciated.
Best,
Milica
- Which version of AnchorWave are you using, please?
- Where did not get the file "MAIZE_B73.fa", is that V4?
- Could you delete "MAIZE_DK105_MAIZE_B73.anchorspro ", rerun and then share all the commands you were using, please?