ebi-ait/ingest-graph-validator

Add new rules for cross-entity validation which ensure data is appropriately grouped for processing by the HCA DCP standard analysis pipelines

Closed this issue · 7 comments

Refer to this ticket for more information and a discussion of the rules.

Rules:

  • If a protocol is paired-end, expect at least 2 files per set
  • If it’s 10x, expect at least 2 files per set (Based on what we've been told, we have to modify one of the already implemented rules, as we have a rule set up to expect 3-4 files for 10x)
  • If protocol is not 10x or SS2, expect following information about reads:
    • Single-end, paired-end
    • Barcode/read offset (UMI and cell barcode)
    • Barcode/read length (UMI and cell barcode)
    • Barcode file, read file indicated
  • The same file is not linked to 2 different sets
  • If more than one R1/R2/I1/I2 in a set, ensure lane index is specified
  • A sequencing and library preparation protocol is linked to the files
    - [ ] Confirm that information from the donor is linked correctly (This one isn't too clear and we couldn't come up with a defined requirement in the meeting) see comment below for explanation.

@ESapenaVentura @rays22 on the last one, did they explain what donor information they needed?

They commented on species (e.g. if the fastq belongs to a human donor it should be human) but I don't think they meant that they needed specific info from that donor, just that it's linked appropriately.

Thanks so confirming the sequence is from the species the submitter claimed it was is a QC check that they need to do if they wanted.

What biomaterial checks do you already do on the graph?

What biomaterial checks do you already do on the graph?

@lauraclarke
Existing rules related to biomaterials:

  • Test: No biomaterials are disconnected
  • Test: Donors are not derived
  • Test: Donor to file path must have at least 5 nodes
  • Test: Sequence files link to cell suspension. The biomaterial linked to the process linked to a sequence file must be a cell suspension.

I have pushed my part of the new test to branch iss1270 to be merged after review later.

Comments:

  • The rule

Confirm that information from the donor is linked correctly (This one isn't too clear and we couldn't come up with a defined requirement in the meeting)

seem not to fit well for the graph validator. If I understood the requirements, it applies to cases when there are more than one donor organism species, or if the donor organism is a chimaera of cells from different organisms (e.g. mouse and human). All the tests in this set are PASS/FAIL tests, but this one should yield just a warning for further checking the sequence data files to identify the species of the cell source.
For the above reasons, this rule was not implemented.

  • Added an extra test to ensure that orphan nodes do not exist in a dataset graph.
    There are already tests that checks specific node types for lack of any connections, but this one is generic. It catches all orphan nodes. I believe orphan nodes must not exist in a dataset graph.

@ESapenaVentura

  • We still need to merge the different github branches with the new cypher validation code into the master branch or an appropriate fork.

Complete.