ISA-tools/isa-api

Are Multiple "<entity> Name" Columns Allowed?

ptth222 opened this issue · 3 comments

I am trying to figure out how to create valid ISA-Tab/ISA-JSON with a more complex sample lineage than what the examples show. The examples are pretty much just source -> sample -> extract, but what about something like source -> sample1 -> sample2 -> extract? Is this allowed or do you have to reduce things down to 1 sample?

The last sentence here https://isa-specs.readthedocs.io/en/latest/isatab.html#study-table-file suggests to me that it should be possible:
"Node properties, such as Characteristics (for Material nodes), Parameter Value (for Process nodes) and additional Name columns for special cases of Process node to disambiguate Protocol REF entries of MUST follow the named node of context."
What does "additional Name columns" refer to if not additional Sample Name columns?

Additionally, I can modify a JSON example and convert it to Tab so that it produces a study file with an additional Sample Name column, but if I try to convert a modified Tab to JSON with an additional Sample Name column there is an error.

Code to modify the BII-I-1.json example and convert to Tab:

with open('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/json/BII-I-1/BII-I-1.json', 'r') as jsonFile:
    isa_example = json.load(jsonFile)
    
samples = []
for sample in isa_example["studies"][0]["materials"]["samples"]:
    samples.append({"@id":sample["@id"]})

growth_protocol2 = {
          "@id": "#protocol/growth_protocol_2",
          "components": [],
          "description": "1. Biomass samples (45 ml) were taken via the sample port of the Applikon fermenters. The cells were pelleted by centrifugation for 5 min at 5000 rpm. The supernatant was removed and the RNA pellet resuspended in the residual medium to form a slurry. This was added in a dropwise manner directly into a 5 ml Teflon flask (B. Braun Biotech, Germany) containing liquid nitrogen and a 7 mm-diameter tungsten carbide ball. After allowing evaporation of the liquid nitrogen the flask was reassembled and the cells disrupted by agitation at 1500 rpm for 2 min in a Microdismembranator U (B. Braun Biotech, Germany) 2. The frozen powder was then dissolved in 1 ml of TriZol reagent (Sigma-Aldrich, UK), vortexed for 1 min, and then kept at room temperature for a further 5min. 3. Chloroform extraction was performed by addition of 0.2 ml chloroform, shaking vigorously or 15 s, then 5min incubation at room temperature. 4. Following centrifugation at 12,000 rpm for 5 min, the RNA (contained in the aqueous phase) was precipitated with 0.5 vol of 2-propanol at room temperature for 15 min. 5. After further centrifugation (12,000 rpm for 10 min at 4 C) the RNA pellet was washed twice with 70 % (v/v) ethanol, briefly air-dried, and redissolved in 0.5 ml diethyl pyrocarbonate (DEPC)-treated water. 6. The single-stranded RNA was precipitated once more by addition of 0.5 ml of LiCl buffer (4 M LiCl, 20 mM Tris-HCl, pH 7.5, 10 mM EDTA), thus removing tRNA and DNA from the sample. 7. After precipitation (20 C for 1h) and centrifugation (12,000 rpm, 30 min, 4 C), the RNA was washed twice in 70 % (v/v) ethanol prior to being dissolved in a minimal volume of DEPC-treated water. 8. Total RNA quality was checked using the RNA 6000 Nano Assay, and analysed on an Agilent 2100 Bioanalyser (Agilent Technologies). RNA was quantified using the Nanodrop ultra low volume spectrophotometer (Nanodrop Technologies).",
          "name": "growth protocol 2",
          "parameters": [],
          "protocolType": {
            "annotationValue": "growth",
            "termAccession": "",
            "termSource": ""
          },
          "uri": "",
          "version": ""
        }
isa_example["studies"][0]["protocols"].append(growth_protocol2)

new_process = {
          "@id": "#process/growth_protocol_2_1",
          "comments": [],
          "date": "",
          "executesProtocol": {
            "@id": "#protocol/growth_protocol_2"
          },
          "inputs": [
            {
              "@id": "#sample/sample-E-0.07-aliquot1"
            }
          ],
          "outputs": [
            {
              "@id": "#sample/sample-E-0.07-aliquot1_1"
            }
          ],
          "parameterValues": [],
          "performer": "",
          "previousProcess": {"@id": "#process/growth_protocol13"}
        }
isa_example["studies"][0]["processSequence"].append(new_process)

new_sample = {
            "@id": "#sample/sample-E-0.07-aliquot1_1",
            "characteristics": [
              {
                "category": {
                  "@id": "#characteristic_category/Material_Type"
                },
                "value": {
                  "annotationValue": "internal",
                  "termAccession": "",
                  "termSource": ""
                }
              }
            ],
            "derivesFrom": [
            ],
            "factorValues": [
              {
                "category": {
                  "@id": "#factor/limiting_nutrient"
                },
                "value": {
                  "annotationValue": "ethanol",
                  "termAccession": "",
                  "termSource": ""
                }
              },
              {
                "category": {
                  "@id": "#factor/rate"
                },
                "unit": {
                  "@id": "#Unit/l/hour"
                },
                "value": 0.07
              }
            ],
            "name": "sample-E-0.07-aliquot1_1"
          }
isa_example["studies"][0]["materials"]["samples"].append(new_sample)

with open('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/BII-I-1_testing.json', 'w') as out_fp:
     json.dump(isa_example, out_fp, indent=2)

with open('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/BII-I-1_testing.json') as file_pointer:
    json2isatab.convert(file_pointer, 'C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/BII-I-1_testing/', validate_first=False)

I modified the BII-S-1 tabular example in 2 different ways, one that adds a new sample before the existing one and another that adds a new sample after the existing one. I have attached them as s_BII-S-1_a.txt and s_BII-S-1_b.txt, respectively, and attached the investigation file with the added protocol as well.
s_BII-S-1_a.txt
s_BII-S-1_b.txt
i_investigation.txt

s_BII-S-1_a.txt will not convert because the validation fails after not finding the samples from the assays in the study file.
s_BII-S-1_b.txt fails with a traceback:

Traceback (most recent call last):

  File "C:\Users\Sparda\AppData\Local\Temp\ipykernel_5600\2959741035.py", line 1, in <cell line: 1>
    isa_json = isatab2json.convert('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/tab/BII-I-1_conversion_testing2', use_new_parser=True, validate_first=False)

  File "C:\Python310\lib\site-packages\isatools\convert\isatab2json.py", line 56, in convert
    ISA = isatab.load(fp)

  File "C:\Python310\lib\site-packages\isatools\isatab\load\core.py", line 283, in load
    ).create_from_df(study_tfile_df)

  File "C:\Python310\lib\site-packages\isatools\isatab\load\ProcessSequenceFactory.py", line 398, in create_from_df
    if source_node_context not in sample_node_context.derives_from:

AttributeError: 'NoneType' object has no attribute 'derives_from'

If I try to convert s_BII-S-1_a.txt with validate_first=False it will have the same traceback.

I investigated the traceback and the issue seems to be an assumption in the ProcessSequenceFactory that there will be one and only one Source Name and Sample Name for part of the code, but not the other part. Certain parts of the code directly look for 'Sample Name' and 'Source Name', but the code that actually produces the error instead uses object_label.startswith('Sample Name'). It is clear that some parts are aware that there can be multiple columns with the same name evidenced by the use of the startswith method, but when initially determining sources and samples only the literal 'Source Name' and 'Sample Name' columns are used. Multiple 'Source Name' and 'Sample Name' columns are not looked for.

So, are multiple " Name" columns allowed?

Hi @ptth222 thanks for the detailed report.

The behaviour you observed is down to the fact that in the ISA-Tab format, the s_ (study) file can only have one Source Name (the graph should start with a Source Node), and one Sample Name Node.
However, Sample Name node are allowed in the a_ (assay) files to allow for aliquoting and fractions of a sample.

Can you open a PR so we can review and acknowledge your contribution to the code base?
many thanks and hi to Hunter.

I assume for the PR you are talking about issue #501. I have created a PR for that issue.

For this issue I am not convinced that the study file can only have 1 Sample Name node and assay files can have multiple. If anything it actually seems like the opposite. At the very least, the documentation and code are at odds with what you have said and these things should be reconciled.

First, I will reiterate what the documentation says:

The last sentence here https://isa-specs.readthedocs.io/en/latest/isatab.html#study-table-file suggests to me that it should be possible:
"Node properties, such as Characteristics (for Material nodes), Parameter Value (for Process nodes) and additional Name columns for special cases of Process node to disambiguate Protocol REF entries of MUST follow the named node of context."

This is specifically saying there can be additional Name columns in the study file.

Secondly, as I said previously, you can generate a study file with multiple Sample Name columns using the converter from JSON to Tab. If you look at a part of the code in the write_study_table_files function you can see that the code specifically counts Sample Name nodes.

        sample_in_path_count = 0
        protocol_in_path_count = 0
        longest_path = _longest_path_and_attrs(paths, s_graph.indexes)
        
        for node_index in longest_path:
            node = s_graph.indexes[node_index]
            if isinstance(node, Source):
                olabel = "Source Name"
                columns.append(olabel)
                columns += flatten(
                    map(lambda x: get_characteristic_columns(olabel, x),
                        node.characteristics))
                columns += flatten(
                    map(lambda x: get_comment_column(
                        olabel, x), node.comments))
            elif isinstance(node, Process):
                olabel = "Protocol REF.{}".format(protocol_in_path_count)
                columns.append(olabel)
                protocol_in_path_count += 1
                if node.executes_protocol.name not in protnames.keys():
                    protnames[node.executes_protocol.name] = protrefcount
                    protrefcount += 1
                columns += flatten(map(lambda x: get_pv_columns(olabel, x),
                                       node.parameter_values))
                if node.date is not None:
                    columns.append(olabel + ".Date")
                if node.performer is not None:
                    columns.append(olabel + ".Performer")
                columns += flatten(
                    map(lambda x: get_comment_column(
                        olabel, x), node.comments))

            elif isinstance(node, Sample):
                olabel = "Sample Name.{}".format(sample_in_path_count)
                columns.append(olabel)
                sample_in_path_count += 1
                columns += flatten(
                    map(lambda x: get_characteristic_columns(olabel, x),
                        node.characteristics))
                columns += flatten(
                    map(lambda x: get_comment_column(
                        olabel, x), node.comments))
                columns += flatten(map(lambda x: get_fv_columns(olabel, x),
                                       node.factor_values))

The write_assay_table_files function however, does not count Sample Name nodes. You can actually see that at one point it did, but that has been commented out:

    for study_obj in inv_obj.studies:
        for assay_obj in study_obj.assays:
            a_graph = assay_obj.graph
            if a_graph is None:
                break
            protrefcount = 0
            protnames = dict()

            def flatten(current_list):
                return [item for sublist in current_list for item in sublist]

            columns = []

            # start_nodes, end_nodes = _get_start_end_nodes(a_graph)
            paths = _all_end_to_end_paths(
                a_graph, [x for x in a_graph.nodes()
                          if isinstance(a_graph.indexes[x], Sample)])
            if len(paths) == 0:
                log.info("No paths found, skipping writing assay file")
                continue
            if _longest_path_and_attrs(paths, a_graph.indexes) is None:
                raise IOError(
                    "Could not find any valid end-to-end paths in assay graph")
            for node_index in _longest_path_and_attrs(paths, a_graph.indexes):
                node = a_graph.indexes[node_index]
                if isinstance(node, Sample):
                    olabel = "Sample Name"
                    # olabel = "Sample Name.{}".format(sample_in_path_count)
                    # sample_in_path_count += 1
                    columns.append(olabel)
                    columns += flatten(
                        map(lambda x: get_comment_column(olabel, x),
                            node.comments))
                    if write_factor_values:
                        columns += flatten(
                            map(lambda x: get_fv_columns(olabel, x),
                                node.factor_values))

                elif isinstance(node, Process):
                    olabel = "Protocol REF.{}".format(
                        node.executes_protocol.name)
                    columns.append(olabel)
                    if node.executes_protocol.name not in protnames.keys():
                        protnames[node.executes_protocol.name] = protrefcount
                        protrefcount += 1
                    if node.date is not None:
                        columns.append(olabel + ".Date")
                    if node.performer is not None:
                        columns.append(olabel + ".Performer")
                    columns += flatten(map(lambda x: get_pv_columns(olabel, x),
                                           node.parameter_values))
                    if node.executes_protocol.protocol_type:
                        oname_label = get_column_header(
                            node.executes_protocol.protocol_type.term,
                            protocol_types_dict
                        )
                        if oname_label is not None:
                            columns.append(oname_label)
                        elif node.executes_protocol.protocol_type.term.lower() \
                                in protocol_types_dict["nucleic acid hybridization"][SYNONYMS]:
                            columns.extend(
                                ["Hybridization Assay Name",
                                 "Array Design REF"])
                    columns += flatten(
                        map(lambda x: get_comment_column(olabel, x),
                            node.comments))
                    for output in [x for x in node.outputs if
                                   isinstance(x, DataFile)]:
                        columns.append(output.label)
                        columns += flatten(
                            map(lambda x: get_comment_column(output.label, x),
                                output.comments))

                elif isinstance(node, Material):
                    olabel = node.type
                    columns.append(olabel)
                    columns += flatten(
                        map(lambda x: get_characteristic_columns(olabel, x),
                            node.characteristics))
                    columns += flatten(
                        map(lambda x: get_comment_column(olabel, x),
                            node.comments))

                elif isinstance(node, DataFile):
                    pass  # handled in process

I also modified an example to have multiple Sample Name columns in an assay file, and it does not get converted to JSON correctly. Specifically, there is not an error, but you cannot find the second Sample Name column samples anywhere in the JSON.

It should also be noted that multiple Sample Name columns in the study or assay files does not produce any sort of validation error or warning.

I hope I have demonstrated how both the documentation and code contradict what you said about Sample Name columns. If you are confident about how multiple Sample Names are supposed to work, then the documentation and code, both validation and conversion, should be changed to reflect that. If what I have shown is correct, however, then the ProcessSequenceFactory code needs to be changed to look for more than 1 Sample Name column. It may need to be changed regardless because it finds 1 set of samples from the study file and uses that as ground truth for the assays, so if there are new samples in an assay (due to having multiple Sample Name columns) they won't be found. Either way, the code and behavior do not agree with what you have said and it needs to be reconciled. I don't mind trying to make the code changes myself, but I do need to be sure about how things are supposed to function.

I worked on another project for a bit, but now I am moving back to the one that involves this package. I really need this to be resolved so that I can move forward. Please consider this a gentle reminder. If a meeting would be better, then I would be happy to meet.