Chromosome path not found in graph or haplotypes index
Opened this issue · 9 comments
1. What were you trying to do?
I used vg rna
to build a spliced pangenome and a pantranscriptome with a GFA file from PGGB and a gff3 file. Similarly, I used vg autoindex
to build index for mpmap and rpvg.
2. What did you want to happen?
Get a spliced pangenome and a pantranscriptome from vg rna
. Get index files from vg autoindex
.
3. What actually happened?
I got a error ERROR: Chromosome path "MHChr01" not found in graph or haplotypes index (line 2)
from vg rna
and vg autoindex
.
4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt
, please copy-paste the contents of that file here:
## vg rna
[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 0.00261929 seconds, 0.0147743 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
ERROR: Chromosome path "MHChr01" not found in graph or haplotypes index (line 2).
## vg autoindex
[vg autoindex] Executing command: vg1_59 autoindex -p auto -w mpmap -w rpvg -g sub_w.gfa -x mh_chr1.gff3
[IndexRegistry]: Checking for haplotype lines in GFA.
[IndexRegistry]: Constructing a GBZ from GFA input.
[IndexRegistry]: Constructing haplotype-transcript GBWT and spliced graph from GBZ-format graph.
ERROR: Chromosome path "MHChr01" not found in graph or haplotypes index (line 2).
5. What data and command can the vg dev team use to make the problem happen?
This is a small graph from PGGB. Two haplotypes in the graph are named MHChr01#0#chr01 and ZSChr01#1#chr01.
sub.gfa.gz
Then I convert to GFA with W lines.
sub_w.gfa.gz
command: vg convert -f sub.gfa > sub_w.gfa
This is a gff3 file.
mh_chr1.gff3.gz
commands for vg rna
and vg autoindex
vg rna -p -t 64 -a -r -e -n mh_chr1.gff3 -s Parent -i sub_chr01_pantranscriptome.txt -b sub_chr01_pantranscriptome.gbwt sub_w.gfa > spliced.gfa
vg autoindex -p auto -w mpmap -w rpvg -g sub_w.gfa -x mh_chr1.gff3
The paths in sub_w.gfa
W MHChr01 0 chr01 0 26966 >1>2>4
W ZSChr01 1 chr01 0 45371 >1>3>4
The gff3 file
##gff-version 3
MHChr01 TBtools gene 7552 15388 . + . ID=OsMH_01G0000025
MHChr01 GSAman mRNA 7552 15388 . + . ID=OsMH_01T0000025.1;Parent=OsMH_01G0000025;Name=OsMH_01T0000025.1;ManInfo=EVD:ISO/EVD:RNA/REF:Nipp/UTR TBD
MHChr01 GSAman exon 7552 7831 . + . Parent=OsMH_01T0000025.1
MHChr01 TBtools five_prime_UTR 7552 7831 . + . Parent=OsMH_01T0000025.1
MHChr01 GSAman exon 7917 8179 . + . Parent=OsMH_01T0000025.1
MHChr01 TBtools five_prime_UTR 7917 8011 . + . Parent=OsMH_01T0000025.1
MHChr01 GSAman CDS 8012 8179 . + 0 Parent=OsMH_01T0000025.1
MHChr01 GSAman exon 8920 9018 . + . Parent=OsMH_01T0000025.1
MHChr01 GSAman CDS 8920 9018 . + 0 Parent=OsMH_01T0000025.1
MHChr01 GSAman exon 10020 10123 . + . Parent=OsMH_01T0000025.1
MHChr01 GSAman CDS 10020 10123 . + 0 Parent=OsMH_01T0000025.1
MHChr01 GSAman exon 11699 12507 . + . Parent=OsMH_01T0000025.1
MHChr01 GSAman CDS 11699 12507 . + 1 Parent=OsMH_01T0000025.1
MHChr01 GSAman exon 12591 12713 . + . Parent=OsMH_01T0000025.1
MHChr01 GSAman CDS 12591 12713 . + 2 Parent=OsMH_01T0000025.1
MHChr01 GSAman exon 12795 12883 . + . Parent=OsMH_01T0000025.1
MHChr01 GSAman CDS 12795 12883 . + 2 Parent=OsMH_01T0000025.1
MHChr01 GSAman exon 12971 13171 . + . Parent=OsMH_01T0000025.1
MHChr01 GSAman CDS 12971 13171 . + 0 Parent=OsMH_01T0000025.1
MHChr01 GSAman exon 13773 14184 . + . Parent=OsMH_01T0000025.1
MHChr01 GSAman CDS 13773 14184 . + 0 Parent=OsMH_01T0000025.1
MHChr01 GSAman exon 14667 14750 . + . Parent=OsMH_01T0000025.1
MHChr01 GSAman CDS 14667 14677 . + 2 Parent=OsMH_01T0000025.1
MHChr01 TBtools three_prime_UTR 14678 14750 . + . Parent=OsMH_01T0000025.1
MHChr01 GSAman exon 14837 14993 . + . Parent=OsMH_01T0000025.1
MHChr01 TBtools three_prime_UTR 14837 14993 . + . Parent=OsMH_01T0000025.1
MHChr01 GSAman exon 15067 15388 . + . Parent=OsMH_01T0000025.1
MHChr01 TBtools three_prime_UTR 15067 15388 . + . Parent=OsMH_01T0000025.1
I don't know why it can't find chr paths. I tried to run vg autoindex
with gfa_with_w_lines.gfa and gfa_with_w_lines.gtf in vg test and it can work.
Then I tried to rename two haplotypes to MHChr01 and ZSChr01, so the GFA file can only be P lines.
P MHChr01 1+,2+,4+,5+
P ZSChr01 1+,3+,4+,6+
Now vg rna
can work.
$ vg1_59 rna -p -t 64 -a -r -e -n mh_chr1.gff3 -s Parent -i sub_chr01_pantranscriptome.txt -b sub_chr01_pantranscriptome.gbwt sub.gfa > spliced.gfa
[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 0.00922727 seconds, 0.0147705 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
Parsed 4 transcripts
Constructed 4 reference transcript paths
Updated graph with reference transcript paths
[vg rna] Transcripts parsed and graph updated in 0.0158957 seconds, 0.023922 GB
[vg rna] Projecting transcripts to haplotypes ...
Parsed 4 transcripts
Projected 3 haplotype-specific transcript paths
[vg rna] Haplotype-specific transcripts constructed in 0.00692912 seconds, 0.0252037 GB
[vg rna] Topological sorting graph and compacting node ids ...
Sorted 550 nodes
[vg rna] Graph sorted and compacted in 0.000983271 seconds, 0.0252037 GB
[vg rna] Adding reference and projected transcripts as embedded paths in the graph ...
Embedded 7 paths in graph
[vg rna] Transcript paths added in 0.00014651 seconds, 0.0252037 GB
[vg rna] Writing pantranscriptome transcripts to file(s) ...
[vg rna] Writing splicing graph to stdout ...
[vg rna] Graph and pantranscriptome written in 0.119253 seconds, 0.281044 GB
But vg autoindex
still can't work.
$ vg1_59 autoindex -p auto -w mpmap -w rpvg -g sub.gfa -x mh_chr1.gff3
[vg autoindex] Executing command: vg1_59 autoindex -p auto -w mpmap -w rpvg -g sub.gfa -x mh_chr1.gff3
[IndexRegistry]: Checking for haplotype lines in GFA.
error:[vg autoindex] Input is not sufficient to create indexes
Inputs
GTF/GFF
Reference GFA
are insufficient to create target index Haplotype-Transcript GBWT
Could you help me see what the problem is? Thanks for your help.
6. What does running vg version
say?
vg version v1.59.0 "Casatico"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by anovak@courtyard.gi.ucsc.edu
vg autoindex --workflow rpvg
and vg rna -b
both require haplotypes, so you need to communicate to interface that you have a source of haplotypes. In your case, that would be W lines in the GFA. However, it appears that what you actually have are reference annotations, and you are simply reinterpreting a reference sequence as a haplotype. When you do that, it confuses the interface, because its looking for a reference sequence to match the annotations, not a haplotype sequence.
I think the correct solution in your case would be to either 1) only convert the non-reference sequence into W lines, or 2) add the RS
tag to the GFA header to indicate that a certain sample corresponds to the reference sequence.
I should also note that the GFF for a given haplotype is expected to follow the Pan-SN naming specification, which has semantics sample#haplotye_number#contig
. In your W and P lines, it looks to me like you are mixing the contig name into the sample, which is not really what's intended.
I should also note that the GFF for a given haplotype is expected to follow the Pan-SN naming specification, which has semantics sample#haplotye_number#contig. In your W and P lines, it looks to me like you are mixing the contig name into the sample, which is not really what's intended.
I know that. I set it in this way solely to better represent my samples for now.
I think the correct solution in your case would be to either 1) only convert the non-reference sequence into W lines, or 2) add the RS tag to the GFA header to indicate that a certain sample corresponds to the reference sequence.
I tried these solutions.
- I convert the non-reference sequence into W lines, and rename reference sequence as MHChr01. The
vg autoindex
andvg rna
can work.
The paths in the GFA file.
P MHChr01 1+,2+,4+,5+,6+,8+
W ZSChr01 0 chr01 0 45371 >1>3>4>6>7>8
- I tried to only add the RS tag to the GFA header in
sub_w.gfa
. But bothvg rna
andvg autoindex
can't work.
It seems that the GFA file has strict format. It must include the reference path with P line and haplotype paths, and the name of reference path cannot follow the Pan-SN naming specification.
In fact, for this species, I have multiple genomes and corresponding annotation files. This means that each haplotype may be considered as the reference.
I tried to run with two annotation files(MHChr01 gff3 and ZSChr01 gff3) .
- If I want to run
vg autoindex
, I must manually set all samples as reference and haplotype in the GFA file at the same time.
The command isvg autoindex -p auto -w mpmap -w rpvg -g sub_chr01.gfa -x mh_chr1.gff3 -x zs_chr1.gff3 -a Parent
The paths in the GFA file.
P MHChr01 1+,2+,4+,5+,6+,8+
P ZSChr01 1+,3+,4+,6+,7+,8+
W MHChr01 0 chr01 0 26966 >1>2>4>5>6>8
W ZSChr01 0 chr01 0 45371 >1>3>4>6>7>8
I got the result from auto.txorigin.tsv
.
Name Length Transcripts Haplotypes
OsMH_01T0000100.2_R1 2507 OsMH_01T0000100.2 MHChr01,MHChr01#0#chr01
OsMH_01T0000100.2_H1 2507 OsMH_01T0000100.2 ZSChr01#0#chr01
OsZS_01T0000100.1_R1 363 OsZS_01T0000100.1 ZSChr01,ZSChr01#0#chr01
The I used the same GFA file to run vg rna
. The command is vgrna -p -t 64 -k 256 -a -r -e -n mh_chr1.gff3 -n zs_chr1.gff3 -s Parent -f sub_chr01_pantranscriptome.fa -i sub_chr01_pantranscriptome.txt -b sub_chr01_pantranscriptome.gbwt sub_chr01.gfa > spliced.gfa
I got the result from pantranscriptome.txt
.
Name Length Transcripts Haplotypes
OsMH_01T0000100.2_R1 2507 OsMH_01T0000100.2 MHChr01,MHChr01#0#chr01#0
OsMH_01T0000100.2_H1 2507 OsMH_01T0000100.2 ZSChr01,ZSChr01#0#chr01#0
OsZS_01T0000100.1_R1 363 OsZS_01T0000100.1 ZSChr01,ZSChr01#0#chr01#0
- When I want to run
vg rna
again, I found only set all samples as reference. However,vg autoindex
can't treat the new GFA as input.
The paths in the new GFA file.
P MHChr01 1+,2+,4+,5+,6+,8+
P ZSChr01 1+,3+,4+,6+,7+,8+
I got the result from pantranscriptome.txt
.
Name Length Transcripts Haplotypes
OsMH_01T0000100.2_R1 2507 OsMH_01T0000100.2 MHChr01
OsMH_01T0000100.2_H1 2507 OsMH_01T0000100.2 ZSChr01
OsZS_01T0000100.1_R1 363 OsZS_01T0000100.1 ZSChr01
It seems that the results are the same, but Haplotypes
column is different. Which is the best way to achieve it? I prefer to use the new GFA in the second way to build all the required files through vg autoindex
. But it's a pity that it can't work.
If you have annotations for the other haplotypes, then that's what the -j
option in vg rna
and -H
in vg autoindex
are intended for. Right now it seems like there's something getting mixed up that is both adding the annotations as reference transcripts and projecting onto other haplotypes (which already have the annotations).
Can you specify what the RS
tag and the W
lines looked like when you said it didn't work? Also what error did it give you? I expect that this will be the most direct route to getting the spliced pangenome graph that you want.
Can you specify what the RS tag and the W lines looked like when you said it didn't work?
This is the GFA file includes RS
tag and w
lines. The MHChr01 annotation file has been given previously.
H VN:Z:1.1 RS:Z:MHChr01
W MHChr01 1 chr01 0 26966 >1>2>4>5>6>8
W ZSChr01 2 chr01 0 45371 >1>3>4>6>7>8
vg rna
$ vg1_59 rna -p -t 64 -k 256 -a -r -e -n mh_chr1.gff3 -s Parent -f sub_chr01_pantranscriptome.fa -i sub_chr01_pantranscriptome.txt -b sub_chr01_pantranscriptome.gbwt sub_w_set_ref.gfa > spliced.gfa
[vg rna] Parsing graph file ...
[vg rna] Graph parsed in 0.00377116 seconds, 0.0202713 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
ERROR: Chromosome path "MHChr01" not found in graph or haplotypes index (line 2).
vg autoindex
$ vg1_59 autoindex -p auto -w mpmap -w rpvg -g sub_w_set_ref.gfa -x mh_chr1.gff3 -a Parent
[vg autoindex] Executing command: vg1_59 autoindex -p auto -w mpmap -w rpvg -g sub_w_set_ref.gfa -x mh_chr1.gff3 -a Parent
[IndexRegistry]: Checking for haplotype lines in GFA.
error:[vg autoindex] Input is not sufficient to create indexes
Inputs
GTF/GFF
Reference GFA
are insufficient to create target index Haplotype-Transcript GBWT
Maybe I confused something. Before I figure it out, the GFA file sub.gfa
I use now is like this.
P MHChr01 1+,2+,4+,5+,6+,8+
P ZSChr01 1+,3+,4+,6+,7+,8+
If you have annotations for the other haplotypes, then that's what the -j option in vg rna and -H in vg autoindex are intended for.
Based on this GFA files, I have no way to use vg autoindex
, which must check the presence of references and haplotypes.
I don't know how to use -j
option properly. What does the -j
option mean? The explanation is use haplotype paths in GBWT index as reference sequences (disables projection)
. My understanding is that the annotation file of each genome only traverses the corresponding genome for annotation, and does not project to other haplotypes. For example, MHChr01.gff3 only annotates MHChr01, and ZSChr01.gff3 only annotates ZSChr01.
Now I tried to use vg rna
. The command is vg1_59 rna -p -t 64 -j -l new_sub_chr01_pantranscriptome.gbwt -s Parent -n mh_chr1.gff3 -n zs_chr1.gff3 -a -r -b pan.gbwt -i pan.txt sub.gfa > spliced.gfa
. The new_sub_chr01_pantranscriptome.gbwt
file includes the index of the two haplotypes. When I tried to stop using -j
and -l
, it seemed like I got the same result. The command is vg1_59 rna -p -t 64 -s Parent -n mh_chr1.gff3 -n zs_chr1.gff3 -a -r -b pan.gbwt -i pan.txt sub.gfa > spliced.pg
Name Length Transcripts Haplotypes
OsMH_01T0000025.1_R1 2943 OsMH_01T0000025.1 MHChr01
OsMH_01T0000075.1_R1 2283 OsMH_01T0000075.1 MHChr01
OsMH_01T0000100.1_R1 2611 OsMH_01T0000100.1 MHChr01
OsMH_01T0000100.2_R1 2507 OsMH_01T0000100.2 MHChr01
OsZS_01T0000100.1_R1 363 OsZS_01T0000100.1 ZSChr01
OsZS_01T0000200.1_R1 795 OsZS_01T0000200.1 ZSChr01
However, it may be difficult to know haplotype specific transcripts clearly. When I read the article Haplotype-aware pantranscriptome analyses using spliced pangenome graphs
, I found that the construction of splicing pangenome is through VCF + reference FASTA, and then projected the reference annotation onto the haplotype to obtain haplotype-specific transcripts. For a transcript that can be projected onto a haplotype, it has the same length, the same start and stop sites on both the reference and haplotype. The difference between them on the reference and haplotype is only caused by the variation of some bases. I don't know if my understanding is correct?
When I have multiple genomes and their corresponding annotations, I don't know how I should build them correctly. Because I have multiple annotation files, the same gene may have different start and end positions and different lengths on two haplotypes. There are also some transcripts that appear on specific haplotypes. These are two types of haplotype specific transcripts in my ideal situation.
Don't worry, I'm just organizing my thoughts and ideas. I really look forward to your suggestions.
When I read the article Haplotype-aware pantranscriptome analyses using spliced pangenome graphs, I found that the construction of splicing pangenome is through VCF + reference FASTA, and then projected the reference annotation onto the haplotype to obtain haplotype-specific transcripts. For a transcript that can be projected onto a haplotype, it has the same length, the same start and stop sites on both the reference and haplotype. The difference between them on the reference and haplotype is only caused by the variation of some bases.
I'm sorry. I misunderstood that transcripts don’t all have the same length. But I have a question: when I project the reference annotation onto the haplotype, not all transcript annotations are projected onto the haplotype. In what situations would that happen?
I think to use the GFA as input for vg rna
, you would need to first convert it to a GBZ file with vg gbwt
and then provide the GBZ as input to vg rna
with the -z
option.
With the vg autoindex
command, it seems to be not detecting that your GFA has haplotypes in it, but it's not totally clear to me why. It should be enough to have a single W line that is not designated as a reference path with the RS
tag. It seems like you already have this boiled down to a minimal reproducing example, so if you want to provide the inputs to me, I can take a look.
Regarding the -j
option, you are correct that vg rna
will not do projection if you provide this option. For users that only have annotations for a reference genome, vg rna
can project the reference transcripts onto haplotypes by finding haplotypes that traverse each exon. However, there are other tools/pipelines that are more capable at comparative annotation, so if you have external annotations for the haplotypes, that's generally preferred. Accordingly, providing -j
to vg rna
or -H
to vg autoindex
automatically disables projection.
Hi,
I have the same issue here. I generated the graph using minigraph Cactus with different reference samples (see below).
<style> </style>seqFile | chm13,hg002.1, hg002.2 | chm13,hg002mat, hg002pat | hg002mat, hg002pat |
---|---|---|---|
reference(RS header in GFA) | chm13 | chm13 hg002mat hg002pat | hg002mat hg002pat |
vg path -L -x | chm13#0#chr22 hg002#1#chr22#0 hg002#1#chr22#1588801 hg002#1#chr22#8063369 hg002#1#chr22#8801348 hg002#1#chr22#9233182 hg002#1#chr22#9296730 hg002#1#chr22#9502398 hg002#1#chr22#17529314 hg002#1#chr22#21081973 hg002#1#chr22#21309690 hg002#2#chr22#276 hg002#2#chr22#1640972 hg002#2#chr22#2109789 hg002#2#chr22#4393926 hg002#2#chr22#5153161 hg002#2#chr22#13673342 hg002#2#chr22#17402067 | chm13#0#chr22 hg002mat#0#chr22 hg002mat#0#chr22[1588801] hg002mat#0#chr22[8063369] hg002mat#0#chr22[8801348] hg002mat#0#chr22[9233182] hg002mat#0#chr22[9296730] hg002mat#0#chr22[9502398] hg002mat#0#chr22[17529314] hg002mat#0#chr22[21081973] hg002mat#0#chr22[21309690] hg002pat#0#chr22[276] hg002pat#0#chr22[1640972] hg002pat#0#chr22[2109789] hg002pat#0#chr22[4393926] hg002pat#0#chr22[5153161] hg002pat#0#chr22[13673342] hg002pat#0#chr22[17402067] | hg002mat#0#chr22 hg002pat#0#chr22[7613] hg002pat#0#chr22[696114] hg002pat#0#chr22[1580247] hg002pat#0#chr22[2109789] hg002pat#0#chr22[3713744] hg002pat#0#chr22[13674874] hg002pat#0#chr22[17239804] |
I’ve tried to match the gene annotation file with the chromosome names, but I’m still encountering the same error message. How can I resolve this? Could you please provide me with an example of the gene annotation file?
My script looks like this:
vg rna -p --threads $SLURM_CPUS_PER_TASK \
-q --transcripts test.gtf --gbz-format \
--write-gbwt ${chr}.pantranscriptome.gbwt \
--use-hap-ref \
--write-info ${chr}.pantranscriptome.txt \
-f ${chr}.pantranscriptome.fa \
minigraph.chr22.gbz \
--gbwt-bidirectional > ${chr}.spliced_graph.pg
@jeizenga Sorry for the late response, I was on vacation for a week. It would be great if you could still take some time to take a look. Here are the GFA file and annotation file I used.
GFA file (W
lines with RS
tag):
sub_w_set_ref.gfa.gz
gbz file:
sub_chr01_pantranscriptome_set_ref.gbz.gz
two annotation files:
mh_chr1.gtf.gz
zs_chr1.gtf.gz
I can't work with the following commands with this GFA file.
# vg rna
vg rna -p -t 64 -k 256 -n mh_chr1.gtf -n zs_chr1.gtf -r -f chr01_pantranscriptome.fa -i chr01_pantranscriptome.txt -b chr01_pantranscriptome.gbwt sub_w_set_ref.gfa > spliced.gfa
# vg rna gbz
vg rna -p -t 64 -k 256 -n mh_chr1.gtf -n zs_chr1.gtf -r -f chr01_pantranscriptome.fa -i chr01_pantranscriptome.txt -b chr01_pantranscriptome.gbwt -z sub_chr01_pantranscriptome_set_ref.gbz > spliced.gfa
# vg autoindex
vg autoindex -p auto -w mpmap -w rpvg -g sub_w_set_ref.gfa -x mh_chr1.gtf -x zs_chr1.gtf
vg autoindex -p auto -w mpmap -w rpvg -g sub_w_set_ref.gfa -H mh_chr1.gtf -H zs_chr1.gtf
Accordingly, providing -j to vg rna or -H to vg autoindex automatically disables projection.
Maybe I wasn’t clear before, but I find this a bit confusing. When I specify multiple annotation files without enabling projection, vg rna
can annotate on the corresponding haplotype in two ways: 1) Without the -e
option and other construction options, it will not do projection, and it will annotate on the corresponding haplotype. 2) -j
must be specified together with -l
, which is the GBWT index file of haplotypes. I also need to build an additional GBWT index file of haplotypes. Both methods yield the same results, but the second way seems more troublesome than the first way. I can't see the advantages of the second way.
@jeizenga I think I have figured out how vg rna
and vg autoindex
run correctly. Here are the tests I conducted and the required documents. Perhaps it will be clearer.
vg rna
# GFA: two haplotypes are written as W lines
# GTF: the name seems necessary to follow SAMPLE#HAPLOTYPE#LOCUS#PHASE_BLOCK(may be the old path metadata model of vg?)
vg rna -p -n mh_chr1.gtf -n zs_chr1.gtf sub_w.gfa > spliced.gfa
# projection
vg rna -p -a -r -e -n mh_chr1.gtf sub_w.gfa -i hap.txt > spliced.gfa
The GFA file and GTF file used in vg rna
.
# GFA
H VN:Z:1.1
W MHChr01 1 chr01 0 26966 >1>2>4>5>6>8
W ZSChr01 2 chr01 0 45371 >1>3>4>6>7>8
# GTF
MHChr01#1#chr01#0 GSAman transcript 7552 15388 . + . transcript_id "OsMH_01T0000025.1"; gene_id "OsMH_01G0000025"
ZSChr01#2#chr01#0 makerRMon transcript 11754 12116 . - . transcript_id "OsZS_01T0000100.1"; gene_id "OsZS_01G0000100"
vg autoindex
# vg autoindex for rpvg must need one reference(written as P line or W line with RS tag) and one haplotype in GFA file, and specified the reference annotation file with -x, --tx-gff option
# GFA: The reference can be written as P line or W line with RS tag. The haplotype is written as W line.
# GTF: The name of reference written as P line is the same as the name of P line. The name of haplotype or reference with W line need to follow PanSN(SAMPLE#HAPLOTYPE#LOCUS/contig_name) instead of SAMPLE#HAPLOTYPE#LOCUS#PHASE_BLOCK
vg autoindex -p auto -w rpvg -w mpmap -g sub_w.gfa -x mh_chr1.gtf -H zs_chr1.gtf
The GFA file and GTF file used in vg autoindex
.
# GFA
H VN:Z:1.1 RS:Z:MHChr01
W MHChr01 1 chr01 0 26966 >1>2>4>5>6>8
W ZSChr01 2 chr01 0 45371 >1>3>4>6>7>8
# GTF
MHChr01#1#chr01 GSAman transcript 7552 15388 . + . transcript_id "OsMH_01T0000025.1"; gene_id "OsMH_01G0000025"
ZSChr01#2#chr01 makerRMon transcript 11754 12116 . - . transcript_id "OsZS_01T0000100.1"; gene_id "OsZS_01G0000100"
Although vg autoindex
can work now, it seems unable to perform haplotype projection and can only annotate the corresponding haplotypes based on the annotation file. Perhaps I haven't found the correct approach, and I would like to know whether vg autoindex
can perform haplotype projection based on reference annotations.