Ethically_culturally_sensitive columns have incorrect/outdated entries
Opened this issue · 20 comments
All of the Ethically_Culturally_sensitive columns for my CMC data are wrong, where sample.Ethically_culturally_sensitive
is all NA, extract.Ethically_culturally_sensitive
is all FALSE, and library.Ethically_culturally_sensitive
is a combination of NA and FALSE, and all should be TRUE.
I just checked the Pandora website and they’re all correctly marked “Yes” there. Is sidora.core pulling old information?
When the 'Ethically/Culturally sensitive' option was added to Pandora, my samples were already in there and were automatically checked “No”, so at one point FALSE was a correct entry for all of them, but now they’re all “Yes” so everything should be TRUE. Since sample.Ethically_culturally_sensitive
is NA this was never correct, while extract.Ethically_culturally_sensitive
and library.Ethically_culturally_sensitive
as FALSE are outdated. I think the NAs in library.Ethically_culturally_sensitive
come from extracts that were never built into libraries so those are probably ok
Uhh that's odd. No, it shouldn't be out of date as it should always pull the data fresh, unless you are using a cache... When did you update the entries in Pandora?
library(magrittr)
con <- sidora.core::get_pandora_connection()
samps <- sidora.core::get_df_list(sidora.core::make_complete_table_list(c("TAB_Site", "TAB_Sample")), con = con) %>%
sidora.core::join_pandora_tables()
samps %>% dplyr::filter(site.Site_Id == "CMC") %>% dplyr::select(sample.Ethically_culturally_sensitive)
It's your type conversion, James. Look at this:
con <- sidora.core::get_pandora_connection()
samps <- sidora.core::get_df_list(sidora.core::make_complete_table_list(c("TAB_Site", "TAB_Sample")), con = con) %>%
sidora.core::join_pandora_tables()
ids <- samps %>% dplyr::filter(site.Site_Id == "CMC") %>% dplyr::pull(sample.Id)
sidora.core::get_con("TAB_Sample", con) %>% tibble::as_tibble() %>% dplyr::filter(Id %in% ids) %>% dplyr::select(Ethically_culturally_sensitive)
@ivelsko So @jfy133 and I probably found the problem: We have a function which transforms column types to represent them correctly in R: sidora.core:::enforce_types()
.
Apparently there seem to be two different ways of encoding logicals in Pandora. The column Ethically_culturally_sensitive
is encoded with Yes
/No
which can not be automatically converted with as.logical
.
I'll try to provide a quick fix now and then leave it to @jfy133 in our next hackhour to check the columns Robot
and Title
😄
I installed the latest version and checked again and it's partially fixed - the Ethically_culturally_sensitive
column for sample
now has the correct TRUE entries, but for extract
, library
, capture
, and sequencing
the entries are still incorrectly FALSE
So: The values are FALSE, not NA, ja? What value do you see in Pandora? I would be really surprised if it is "Yes".
Yep, they're FALSE. In Pandora it says "Yes" in each of those 4 tabs, but it's in a grayed-out text box that's automatically populated down from checking the "Yes" button on the Individual tab. I don't have any control on changing that entry on any tab except Individual
I'm confused now. Please take a look at the result of this query:
library(magrittr)
con <- sidora.core::get_pandora_connection()
samps <- sidora.core::get_df_list(sidora.core::make_complete_table_list(c("TAB_Site", "TAB_Extract")), con = con) %>%
sidora.core::join_pandora_tables()
ids <- samps %>% dplyr::filter(site.Site_Id == "CMC") %>% dplyr::pull(extract.Id)
sidora.core::get_con("TAB_Extract", con) %>% tibble::as_tibble() %>% dplyr::filter(Id %in% ids) %>% dplyr::select(Ethically_culturally_sensitive)
DBI::dbDisconnect(con)
Two things:
-
The
Ethically_culturally_sensitive
column in theextract
table seems to be of type integer. So that's why the conversion to boolean fails. It's very bad that the same type of information is encoded in different ways in different tables and we should discuss this with the Pandora devs (who do we have to contact, @stschiff & @jfy133? Or could this still be caused by the R interface?) -
The query above yields
0
s for the extracts from site CMC. Why not1
s?
I just checked, and it is only Sample that is problematic. The others are all 0,1.
But indeed, I have no idea what's going on with CMC... the only thing I can thing of is that there is some wierd propagation thing for the web UI only, where if you say at e.g. sample library something is sensitive that that is 'applied' everything downstream for display (with the assumption people shouldn't look further down?).
We should write an email to Robert F. about it though, as I'm completely lost... I'll put a reminder tomorrow.
My code (after your con set up above):
sidora.core::get_con("TAB_Sample", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Extract", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Library", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Capture", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Sequencing", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Ethic"))
sidora.core::get_con("TAB_Extract", con) %>% tibble::as_tibble() %>% dplyr::select(dplyr::contains("Id"), dplyr::contains("Ethic")) %>% dplyr::filter(Ethically_culturally_sensitive == 1) %>% print(n = 30)
Alright. 0
/1
can be converted automatically by as.logical
. So the column in TAB_Sample should be transformed, the others can stay as they are.
The CMC mismatch is really curious. Irina confirmed that the info both in the webinterface and in the csv export is correct. Only we're getting the wrong data.
Yeah that makes me very nervous. I'll try and ask Robert tomorrow
Ok, the Extract level 1
s were mistakes. This is now fixed.
Only Sample defines this. So I think @nevrome we need to have a different system of loading stuff now, where we internally find that information already and remove those at data loading. It's ugly but I'm not sure how we can deal with that unfortunately.
We had a little discussion with the sidora dev team. Having to refer to the Sample table every time we want to exclude sensitive samples is very computationally intensive.
We will ask Robert to make all those columns the same type, and then propagate this downstream. Then we set our functions to by default remove any row with a sensititve column is true, UNLESS someone requests it purposefully.
Any public version of the API will always remove these, even if published data.
This is almost ok, but I see an issue with it. For calculus/microbiome samples the only sensitive part is the human reads, and those shouldn't be analyzed by anyone. The rest of the data is fine for anyone to use, so the whole dataset shouldn't be removed for searching in this case.
It's really a problem with what the Ethically/culturally sensitive check box is supposed to indicate, which is that the data shouldn't be automatically screened/processed by the pipeline here. In some cases (ALC/FPA, for example) the whole sample and all downstream parts should be excluded, but for CMC only part of the data is off limits
The problem is that the final data that is recorded in the raw data tab is the entire sequencing data, not filtered. We can then not guaruntee that people won't use it anyway, or just skip checking the 'sample' tab entirely. Secondly, with that box we don't necessarily specify what is actually sensitive about it. To find it, they would have to speak to the person.
That's why there would be an option to not filter it out (this is what I mean, 'by default' - maybe I wasn't clear), but you would only turn that on if you knew what you were looking ofr, and you are also aware you will have to contact other pepole to find out if you can use it.
Quick update on this in February 2022:
sample.Ethically_culturally_sensitive
, so the iteration of theEthically_culturally_sensitive
column inTAB_Sample
is still the only one relevant. All other columns downstream are still only empty dummies used for the UI. Maybe we should explicitly remove them in the download process.- Automatic processing in the autorun pipeline excludes sensitive samples, afaik.
Do you mean explicitly removing all downstream entries stemming from samples marked as ethically/culturally sensitive? I'm for that.
If you mean removing the other columns that repeat ethically/culturally sensitive at lower levels (extract, library, etc) then I'd prefer those be left in b/c at least it's something to indicate they shouldn't be worked with. Particularly if someone only downloads a specific table and they won't see that column from the sample table.
Autorun does exclude the sensitive samples, as long as they actually were tagged that way. There were a bunch that should have been and never were that I caught, but possibly there are entries by other people, maybe some who've left, that also weren't marked and should be.
I mean removing the columns 🤔. They are not reliably filled with information, so they can not be used for any decisions. Only the TAB_Sample
one is, to my understanding.
Let's think about other ways to raise awareness of this issue. We could add a startup message to the package. Whenever somebody loads the library the first time in a session with library(sidora.core)
we could print a warning. Many packages do this.
For example:
Warning: Pandora features many samples that are ethically and culturally sensitive and are therefore not available for many analyses.
They are indicated in the column Ethically_culturally_sensitive of TAB_Sample. Make sure to exclude these samples depending on your application!
I'm sure you have a better phrasing in you, @ivelsko.
For downstream applications/scripts like the ones by @TCLamnidis, this message can be suppressed with suppressPackageStartupMessages(library(sidora.core))
. Not so sure about calls with ::
, though...
I like the idea of printing a warning. Can you print it with a big red ASCII-art yield sign? I tend to ignore these messages that come up at the start unless there's some image that catches my eye...
I think the text is pretty good, I'd suggest only some small changes
Warning: Pandora contains many samples that are ethically and/or culturally sensitive and are therefore off-limits for many analyses.
They are indicated in the column Ethically_culturally_sensitive of TAB_Sample. Make sure to exclude these samples depending on your application and permissions!