nfdi4plants/Swate-templates

[BUG] Found multiple iterations of the same tags in Swate

Freymaurer opened this issue · 5 comments

This image lead me to open this issue:

image

In this image you can see 3 different PRIDE tags. One as Tag, two as ER_Tag. One of the ER_Tags has an id the other has not.

To clean up these things i ran some very simple analytics (results below). Would be nice if someone could clean this up 😄

Found ambiguous tag growth in:

  • Bacterial growth conditions by (Viktoria Petrova)
  • Growth protocol for Study file MIAPPE by (Julie Jacquemin)

Found ambiguous tag Plant in:

  • MAdLand Sample information by (Fabian Haas)
  • Growth chamber by (Dominik Brilhaus)
  • Plant Source Material by (Dominik Brilhaus)

Found ambiguous tag study in:

  • Study minimal MPIMP Fernie by (Micha Wijesingha Ahchige)
  • MIAPPE metadata by (Hannah Dörpholz, Elisa Senger, Stella Eggels)

Found ambiguous tag Proteomics in:

  • RPTU - MBS, growth by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • RPTU - MBS, growth TurboID by (Frederik Sommer, Martin Kuhl, Oliver Maus, David Zimmer)
  • RPTU - MBS, cell disruption by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • RPTU - MBS, data processing by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • RPTU - MBS, mass spectrometry by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • RPTU - MBS, protein extraction by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • RPTU - MBS, sample preparation by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • RPTU - MBS, protein standard preparation by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • Protein extraction by (Oliver Maus, Dominik Brilhaus)
  • Proteomics MassSpec Assay by (Oliver Maus)
  • Proteomics Computational Analyses by (Oliver Maus)
  • Data Processing (PRIDE minimal) by (Oliver Maus)
  • Measurement (PRIDE minimal) by (Oliver Maus)
  • Sample Preparation (PRIDE minimal) by (Oliver Maus)

Found ambiguous tag PRIDE in:

  • RPTU - MBS, growth by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • RPTU - MBS, growth TurboID by (Frederik Sommer, Martin Kuhl, Oliver Maus, David Zimmer)
  • RPTU - MBS, cell disruption by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • RPTU - MBS, data processing by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • RPTU - MBS, mass spectrometry by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • RPTU - MBS, protein extraction by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • RPTU - MBS, sample preparation by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • RPTU - MBS, protein standard preparation by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • Protein extraction by (Oliver Maus, Dominik Brilhaus)
  • Proteomics MassSpec Assay by (Oliver Maus)
  • Proteomics Computational Analyses by (Oliver Maus)
  • Data Processing (PRIDE minimal) by (Oliver Maus)
  • Measurement (PRIDE minimal) by (Oliver Maus)
  • Sample Preparation (PRIDE minimal) by (Oliver Maus)

Found ambiguous tag Transcriptomics in:

  • RNASeq Assay by (Hajira Jabeen, Dominik Brilhaus)
  • RNASeq Computational Analysis by (Hajira Jabeen, Dominik Brilhaus, Oliver Maus, Martin Kuhl)
  • GEO - Minimal information RNA assays by (Martin Kuhl)
  • GEO - Minimal information computational analysis by (Martin Kuhl)

Code

#r "nuget: ARCtrl, 1.0.7"
//.fsx file

let templates = 
  ARCtrl.Template.Web.getTemplates None |> Async.RunSynchronously

let distinctTags = ARCtrl.Template.Templates.getDistinctOntologyAnnotations (templates)
distinctTags.Length // 110

let groupedByName = distinctTags |> Array.groupBy (fun oa -> oa.NameText)
groupedByName.Length //104

let ambiguousTags = groupedByName |> Array.filter (fun (name, c) -> c.Length > 1)
ambiguousTags.Length // 6

for (name,tags) in ambiguousTags do
  let temps = ARCtrl.Template.Templates.filterByOntologyAnnotation (tags) templates
  printfn "## Found ambiguous tag `%s` in:" name
  for template in temps do 
    let authors = 
      template.Authors 
      |> Array.map (fun a -> 
        let names = [|a.FirstName; a.MidInitials; a.LastName|] |> Array.map (fun n -> Option.defaultValue "" n)
        sprintf "%s %s %s" names.[0] names.[1] names.[2]
      ) 
      |> String.concat ", "
    printfn "- **%s** by (*%s*)" (template.Name.Trim()) (authors.Trim())

So the solution is adding an accession number to every tag?
Another issue is tags that are not identical but similar. E.g. there is Plant, plant and Plants. I think adding accession numbers could help also here. I had been planning to discuss this in our upcoming meeting.

grafik

(For me the same tags are shown for ER and normal tags, but I guess that has been fixed already.)

Another issue is tags that are not identical but similar. E.g. there is Plant, plant and Plants

I think this is a valid point. I am thinking about adding a qualitity control CI for pull requests which runs the code i used for my two issues today + a similiarity test for similiar words. Then before merging any PR one could see if these points are handled somewhat correctly.

What do you think about this? It would add another test to this:

image

Sounds good to me

I will start adding tag term accession numbers to the ambiguous terms from your check.

The first iteration of fixes went through, therefore i am going to update the current state here. Please note, that we now also test for similiar tags. If you find a combination to be a true difference (which can be very likely) please notify me below, so i can either increase the similiarity threshold or whitelist a specific combination. The current similiarity threshold is 0.8.

Edit: I will try to improve the script so the output is less split.

Found similiar tags for plant growth protocol in:

  • growth protocol [0.812500] Growth chamber by (Dominik Brilhaus)
  • growth protocol [0.812500] MIAPPE observation unit and sample by (Hannah Dörpholz, Elisa Senger, Stella Eggels)
  • growth protocol [0.812500] RPTU - MBS, growth by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • growth protocol [0.812500] RPTU - MBS, growth TurboID by (Frederik Sommer, Martin Kuhl, Oliver Maus, David Zimmer)
  • growth protocol [0.812500] GEO - Minimal information plant growth by (Martin Kuhl)

Found similiar tags for BioImageArchive in:

  • BioImageArchive_Imaging [0.848485] Imaging assay by (Christine Rempfer)

Found similiar tags for extraction in:

  • Extraction [1.000000] DNA extraction by (Angela Kranz, Dominik Brilhaus)
  • Extraction [1.000000] Imaging extraction by (Chistine Rempfer)
  • Extraction [1.000000] Imaging computation by (Chistine Rempfer)
  • Extraction [1.000000] GEO - Minimal information RNA extraction by (Martin Kuhl)

Found similiar tags for RNA extraction protocol in:

  • extraction protocol [0.900000] Metabolite Extraction by (Dominik Brilhaus, Martin Kuhl)
  • extraction protocol [0.900000] RPTU - MBS, cell disruption by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • extraction protocol [0.900000] RPTU - MBS, protein extraction by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • extraction protocol [0.900000] GEO - Minimal information RNA extraction by (Martin Kuhl)

Found similiar tags for extraction protocol in:

  • RNA extraction protocol [0.900000] RNA extraction by (Hajira Jabeen, Dominik Brilhaus)

Found similiar tags for Extraction in:

  • extraction [1.000000] RNA extraction by (Hajira Jabeen, Dominik Brilhaus)
  • extraction [1.000000] Protein extraction by (Oliver Maus, Dominik Brilhaus)

Found similiar tags for Assay in:

  • assay [1.000000] Phenotyping protocol for Assay file MIAPPE by (Julie Jacquemin)
  • assay [1.000000] Sampling protocol for Assay file MIAPPE by (Julie Jacquemin)

Found similiar tags for Mass Spectrometry in:

  • Mass spectrometry [1.000000] Proteomics MassSpec Assay by (Oliver Maus)
  • Mass spectrometry [1.000000] Data Processing (PRIDE minimal) by (Oliver Maus)
  • Mass spectrometry [1.000000] Measurement (PRIDE minimal) by (Oliver Maus)
  • Mass spectrometry [1.000000] Sample Preparation (PRIDE minimal) by (Oliver Maus)

Found similiar tags for observation unit in:

  • Observation Unit [1.000000] MIAPPE observation unit and sample by (Hannah Dörpholz, Elisa Senger, Stella Eggels)

Found similiar tags for Measurement in:

  • measurement [1.000000] MAdLand Nanodrop by (Fabian Haas)

Found similiar tags for data processing protocol in:

  • Data processing [0.848485] Proteomics Computational Analyses by (Oliver Maus)

Found similiar tags for study in:

  • study [0.888889] Aerial conditions protocol for Study file MIAPPE by (Julie Jacquemin)
  • study [0.888889] Characteristics for Study file MIAPPE by (Julie Jacquemin)
  • study [0.888889] Event protocol for Study file MIAPPE by (Julie Jacquemin)
  • study [0.888889] Growth protocol for Study file MIAPPE by (Julie Jacquemin)
  • study [0.888889] Nutrition protocol for Study file MIAPPE by (Julie Jacquemin)
  • study [0.888889] Rooting protocol for Study file MIAPPE by (Julie Jacquemin)
  • study [0.888889] Watering protocol for Study file MIAPPE by (Julie Jacquemin)

Found similiar tags for Data processing in:

  • data processing protocol [0.848485] Metabolomics Computational Analysis by (Dominik Brilhaus, Oliver Maus, Martin Kuhl)
  • data processing protocol [0.848485] RPTU - MBS, data processing by (Frederik Sommer, Martin Kuhl, Oliver Maus)
  • data processing protocol [0.848485] GEO - Minimal information computational analysis by (Martin Kuhl)

Found similiar tags for Mass spectrometry in:

  • Mass Spectrometry [1.000000] Metabolomics MassSpec Assay by (Dominik Brilhaus, Martin Kuhl)
  • Mass Spectrometry [1.000000] Metabolomics Computational Analysis by (Dominik Brilhaus, Oliver Maus, Martin Kuhl)
  • Mass Spectrometry [1.000000] MTH00029 by (Dominik Brilhaus)
  • Mass Spectrometry [1.000000] MPIMP - Fernie, mass spectrometry by (Micha Wijesingha Ahchige)

Found similiar tags for growth protocol in:

  • plant growth protocol [0.812500] Plant growth by (Hajira Jabeen, Dominik Brilhaus, Oliver Maus, Martin Kuhl, Xiaoran Zhou)
  • plant growth protocol [0.812500] Study minimal MPIMP Fernie by (Micha Wijesingha Ahchige)

Found similiar tags for BioImageArchive_Imaging in:

  • BioImageArchive [0.848485] Imaging extraction by (Chistine Rempfer)
  • BioImageArchive [0.848485] Imaging computation by (Chistine Rempfer)#

Found similiar tags for Observation Unit in:

  • observation unit [1.000000] MIAPPE biological material by (Hannah Dörpholz, Elisa Senger, Stella Eggels)

Found similiar tags for measurement in:

  • Measurement [1.000000] Proteomics MassSpec Assay by (Oliver Maus)
  • Measurement [1.000000] Measurement (PRIDE minimal) by (Oliver Maus)

Found similiar tags for phenotyping in:

  • phenotyping [0.952381] Phenotyping protocol for Assay file MIAPPE by (Julie Jacquemin)

Found similiar tags for assay in:

  • Assay [1.000000] Proteomics MassSpec Assay by (Oliver Maus)
  • Assay [1.000000] Genomics Assay by (Angela Kranz, Dominik Brilhaus)
  • Assay [1.000000] Imaging assay by (Christine Rempfer)
  • Assay [1.000000] Genome assembly by (Angela Kranz, Dominik Brilhaus, Oliver Maus)

Found similiar tags for phenotyping in:

  • phenotyping [0.952381] Phenotyping protocol for Assay file MIAPPE by (Julie Jacquemin)
  • phenotyping [0.952381] Sampling protocol for Assay file MIAPPE by (Julie Jacquemin)
  • phenotyping [0.952381] Aerial conditions protocol for Study file MIAPPE by (Julie Jacquemin)
  • phenotyping [0.952381] Characteristics for Study file MIAPPE by (Julie Jacquemin)
  • phenotyping [0.952381] Event protocol for Study file MIAPPE by (Julie Jacquemin)
  • phenotyping [0.952381] Growth protocol for Study file MIAPPE by (Julie Jacquemin)
  • phenotyping [0.952381] Nutrition protocol for Study file MIAPPE by (Julie Jacquemin)
  • phenotyping [0.952381] Rooting protocol for Study file MIAPPE by (Julie Jacquemin)
  • phenotyping [0.952381] Watering protocol for Study file MIAPPE by (Julie Jacquemin)

Found similiar tags for study in:

  • study [0.888889] MIAPPE metadata by (Hannah Dörpholz, Elisa Senger, Stella Eggels)
  • study [0.888889] MIAPPE observation unit and sample by (Hannah Dörpholz, Elisa Senger, Stella Eggels)
  • study [0.888889] Study minimal MPIMP Fernie by (Micha Wijesingha Ahchige)