poseidon-framework/poseidon-framework.github.io

Column specifications and content

Closed this issue · 5 comments

Concerning https://github.com/poseidon-framework/poseidon2-schema/blob/master/janno_columns.tsv

1. Publication Status

Publication_Status bibtex key (e.g. "@AuthorJournalYear") or "unpublished"

It seems that if the Publication status actually starts with an @, either the bibtex validator doesn't accept the key containing the character, or the poseidon validator prints the following error:

! The .bib file does not contain the literature in the janno file or the bibtex keys are different
! This seems to be a valid package, but some things are fishy.

Removing the @ from the .janno file fixed the issue, but it seems like an update is needed, either to the validator (strip leading @) or to the content explanation for the field.

2. mtContamination error

mtContam_stderr Standard error of ContamMix/Schmutzi estimate

ContamMix doesn't actually return a stderr, but a 95% confidence interval instead, making the error around the mode asymmetric. In my own package I have reported the largest difference between MAP and the edges of the 95% confidence interval, but that can be somewhat misleading. It would be good to either allow people to specify mtDNA contamination error as an interval or two fields with min and max of the CI (which can be done for both stderr and 95%CI), or give clear instructions on how one should report a 95%CI here.

@AyGhal, @stschiff, @wolfgangaroo: The second issue raised here by @TCLamnidis is unfortunately not equal to the problem we discussed in poseidon-framework/poseidon-schema#9 and solved with the code here. The Reich Lab .anno file does not report mtContamination, so this didn't come up yet.

I wonder how this was solved by the people preparing data so far?

This is frustrating. We made this choice of using a standard-error, but that turns out to be not ideal, neither for the nuclear contamination nor for the MT contamination. It's simple, there are two options:

  1. Change to confidence interval notation (lots of work for existing packages)
  2. Use the same hack as in the nuclear contamination case and simply report the larger "radius" from mean to confidence interval limit as the stderr.

I would lean towards the second solution.

I'm leaning with you.

As both @stschiff and @wolfgangaroo are leaning towards a solution that involves no changes to the standard, but only to the documentation, I moved it here into the homepage repository.