Obtain path sequences as specified in P-lines
Closed this issue · 11 comments
Hi, thanks for providing this great toolkit!
I found this toolkit to be the only one I could find that is able to provide the fasta sequence for a path. The only thing is, I have to manually specify this. Would it be possible to supply the PathName
of a path (or by default obtain the sequence for all paths)? My current workaround for one path is:
pathname="G1S1" #name of the path
pathsegments=$(awk -vpathname=${pathname} 'BEGIN{FS = OFS = "\t";} $1=="P" && $2==pathname {print $3;}' test.gfa) #obtaining the order of segments for the path
gfatk path test.gfa "${pathsegments}" | awk -vpathname=${pathname} '/^>/{$1=">"pathname} {print;}' #use gfatk for obtaining the sequence and change fasta header back to name of the path
However, ideally I think it should be possible to say something like gfatk path test.gfa --all > all_paths.fa
, if P-lines are specified in the GFA format. (I have not found any tool that can do this but yours seems closest.) What are your thoughts on this?
Hello! Glad you find this tool useful :)
Yes, that is a bit of an annoying workaround I admit. I never got around to implementing P
segments in the toolkit, as I wasn't using them at the time. That's the only reason they are not in there.
I think you are right, this should be easy enough to implement it. I'll take a look and update the code. A nice API might be:
// for all paths like you say
gfatk path test.gfa --all > all_paths.fa
// for a specific named path
gfatk path test.gfa --name G1S1 > G1S1.fa
What do you think? Then a user can get a specific named path.
That looks perfect! It would be great to have it work like that :)
My specific usecase is that I would like to see whether GFA files constructed from specific sequences still contain the exact same sequences I used as input.
I've added an implementation of this (not the --name
bit yet). You'll have to compile until I get it updated on conda etc, hope that's okay!
Thanks a lot, this works for me!
Maybe another thing to consider is that the fasta header of large GFA files becomes really large as --all
now puts all segments and their orientation in the fasta header. I'd say it is acceptable to have the PathName
to be the only thing in the fasta header when obtaining all paths (they are likely very long paths, otherwise one could just simply write the segments by hand).
Great - can change that. Out of interest how long are your paths on average?
I am working with compressed de Bruijn graphs, so I recently created a bacterial cDBG with one genome for which I built a GFA file. The path for this bacterial sequence had >900,000 segments, so already using less
on the P-line in the GFA file was super slow. In the end, I want to create cDBGs for higher plants and animals and I expect the graphs to easily contain tens of millions of segments per sequence (in this case, per chromosome).
Woah, that's waaay(!!) bigger than I had thought. Do let me know what the performance is like, that would be interesting! I've really barely made any optimisations, so lots of room for improvement.
I already saw that in the code but I was surprised by the speed! Probably in a couple of weeks, I'll try it out on properly large sets such as the plant and animal sets, but this is the speeds on my two "small" test GFA files:
% cut -f 1 tmp.gfa | sort | uniq -c #to get some stats on the GFA file
1 H
161 L
4 P
121 S
% awk '$1=="P"{n=split($3,a,","); print $2,n;}' tmp.gfa #to get some stats on the number of segments per path
G1S1 172
G2S1 148
G3S1 160
G4S1 149
% time gfatk path -a tmp.gfa > tmp.gfa.fa #time the actual command
gfatk path -a tmp.gfa > tmp.gfa.fa 0.00s user 0.00s system 76% cpu 0.010 total
% cut -f 1 prokka.gfa | sort | uniq -c #to get some stats on the GFA file
1 H
877962 L
1 P
574074 S
% awk '$1=="P"{n=split($3,a,","); print $2,n;}' prokka.gfa #to get some stats on the number of segments per path
G1S1 916827
% time gfatk path -a prokka.gfa > prokka.gfa.fa #time the actual command
gfatk path -a prokka.gfa > prokka.gfa.fa 2.13s user 0.21s system 99% cpu 2.347 total
I'll update you later on the large sets!
Thanks for the info! Super helpful - faster than I thought too, so that's something!
I just ran it on an Arabidopsis thaliana cDBG built from 8 genomes. These are some stats:
$ cut -f 1 arabidopsis.gfa | sort | uniq -c #to get some stats on the GFA file
1 H
18763264 L
844 P
11726287 S
$ time gfatk path -a arabidopsis.gfa | fold -w60 > arabidopsis.fa #time the actual command
real 348m27.145s
user 332m5.310s
sys 16m27.271s
$ seqkit stats arabidopsis.fa #to get some stats on the resulting FASTA file
file format type num_seqs sum_len min_len avg_len max_len
arabidopsis.fa FASTA DNA 844 962,397,596 86 1,140,281.5 30,427,671
This is clearly starting to push the limits, but it's still surpassing my expectations!
That's a huge GFA! I'm glad to see it still has some utility on those. So it's actually not so slow on single paths (<30s) but you have over 800 there!
Thanks for these benchmarks, I'll probably put them in the README!