Confusion regarding arrow formatted files

Question

Confusion regarding arrow formatted files

TBradley27 opened this issue 8 months ago · 6 comments

Hello,

In the documentation for cramino, it states that a file in arrow format can be produced which can then be used with NanoPlot.

However, the documentation for NanoPlot does not describe how this arrow file can be used

Answer 1 · 2024-05-06T06:48:48.000Z

Hi,

Oh yes, I see that it is poorly described. Thanks for letting me know. The arrow files are, confusingly enough, the same as feather, but I have now specified that in the documentation.
So you can use NanoPlot with --arrow to specify the arrow input files.

Best,
Wouter

Answer 2 · 2024-05-06T13:40:18.000Z

Many thanks for this!

That is very helpful.

Just a very quick minor note, it would also be helpful if there was a column for arrow/feather formatted data for the table in the 'plots generated' section of the README

Many thanks again!
Thomas

Answer 3 · 2024-05-06T18:55:21.000Z

Hmm, no, that wouldn't be accurate. An arrow format is essentially the dataframe of features, and different plots can be generated depending on how the file was created.

Answer 4 · 2024-05-06T20:17:09.000Z

Thanks, that makes sense

I generated an arrow formatted file from a sorted bam file. When I ran the arrow formatted file through NanoPlot using --feather I was returned a report that didn't include plots relating to read quality scores or to mapping quality scores - which is different behaviour to when I passed the sorted bam file directly to NanoPlot using --bam

Answer 5 · 2024-05-06T20:24:34.000Z

Yes, that is as expected. In my opinion, read quality scores are less informative than sequence identity scores. Therefore, cramino doesn't extract/calculate them, and they're not in the arrow file. It is a matter of being efficient. If you care a lot about mapping quality, you could also use https://github.com/wdecoster/make_arrow

Answer 6 · 2024-05-06T20:39:38.000Z

Thanks for that, I will check it out. As the original issue has been fixed, I am happy for this issue to be closed