sidora-tools/sidora.core

Analysis tab format is inconsistent with the other tabs

Opened this issue · 4 comments

This is more of a formatting change request than an issue. I wanted to use some information in the Analysis tab, but this information is presented differently from the other tabs. That made selecting/filtering for what I wanted more involved, because I kept losing samples and had to figure out why.

Instead of having each entry in Analysis as a column, it's all as rows under 2 columns (analysis.Title, analysis.Result). Is it necessary that the information is presented this way instead of making each entry an individual column?

The way it is currently means that if a sequenced library isn't run through this analysis, it won't have an entry, not even an place-holder NA, so when I filtered for "Initial reads", a bunch of samples I wanted to include were lost from my table (yes, these were blanks. Absolutely I still need them).

I did realize the entries under analysis.Title are not consistent, which is coming from Pandora itself, which is a problem. For example, GRG003.B0101.SG1.1.Human_Shotgun has:

Initial reads (forward+reverse): 
Failed reads (fwd+rev): 
Failed reads (fwd+rev) in %: 
Merged reads: Merged reads in %: 
Mapped reads (fwd+rev+merged): 
Mapped reads (fwd+rev+merged) in %: 
Mapped fragments: 
Mapped fragments in %: 
Mapped fragments (L>=30): 
Mapped fragments (L>=30) in %:

while GRG004.A0101.SG1.1.Human_Shotgun has:

Initial reads: 
Failed reads: 
Failed reads in %: 
Mapped reads/fragments: 
Mapped reads/fragments in %: 
Mapped reads/fragments (L>=30): 
Mapped reads/fragments (L>=30) in %:

Is that difference b/c the human shotgun screening pipeline changed? Can it be normalized across Pandora, so that you can make the entries columns like for the other tabs?

Concerning your first question: Changing the structure from this long to a wide format is something you could usually do easily e.g. with tidyr::pivot_wider().

Unfortunately the second issue you raise -- the inconsistency of the analysis titles -- makes exactly this transformation way more tricky.

Maybe @jfy133 or even @kaypruefer has some input here how this information could be standardized?

The inconsistency of the analysis titles I think is because of different versions of the pipeline which in principle should be indicated at the Analysis level (not Results String). That said, I don't think Kay has officially announced that these pipelines are stable yet so maybe that is the problem.

The two Analysis entries differ because the runs differed. One is paired end, the other is single read. I've chosen different naming conventions to make clear how exactly the reads are counted.

There are no conventions on how Analysis entries can look like, deliberately so. You'll have to check what type of Analysis is run and then have an understanding of the fields based on the type of Analysis. Documentation for this is completely absent at the moment. I hope to change that, eventually.

I talked w/ @jfy133 this morning, and if you're going to keep the rows instead of columns format, I strongly suggest the difference be documented in the readme. Speaking from the perspective of someone who isn't familiar w/ how you've structured the tabs (any new user), this is a completely unexpected change. Since every other tab is formatted the same way, there's no way to know or reason to suspect that this one is different.

So going along pulling out what I need, all of my select() commands work until suddenly they fail for every possible analysis.xyz column, b/c that information now exists in rows of 2 columns with names that don't exist in Pandora. The user would either have to know in advance what those column names are, or open and look through the table beforehand.

I suggest adding something like "The Analysis tab is formatted differently from the other tabs in sidora. Instead of each entry existing as a column, there are 2 columns analysis.Table and analysis.Results, where the entries (ie Initial reads, Failed reads, etc) and their values are the rows of these 2 columns"