Convoluted conversion from JSON to mwTab directly
ptth222 opened this issue · 15 comments
The whole package essentially assumes everything is starting from a file on disk, so if you already have a JSON dictionary constructed in memory it is not straightforward to create the mwTab format file. One method is to save the JSON out and then read it back in.
Excerpt from old MESSES convert code:
mwtab_json_fpath = "{}.json".format(splitext(results_path)[0])
mwtab_txt_fpath = "{}.txt".format(splitext(results_path)[0])
# save JSON
with open(mwtab_json_fpath, 'w', encoding="utf-8") as outfile:
json.dump(mwtabfile, outfile, indent=4)
# save mwTab (.txt)
with open(mwtab_txt_fpath, 'w', encoding="utf-8") as outfile:
mwfile = next(mwtab.read_files(mwtab_json_fpath))
mwfile.write(outfile, file_format="mwtab")
Another way is a bit hacky:
mwtabfile = mwtab.mwtab.MWTabFile("")
mwtabfile.update(output_json)
mwtabfile.header = " ".join(
["#METABOLOMICS WORKBENCH"]
+ [item[0] + ":" + item[1] for item in mwtabfile["METABOLOMICS WORKBENCH"].items() if item[0] not in ["VERSION", "CREATED_ON"]]
)
with open(args["<output-name>"], 'w', encoding='utf-8') as outfile:
mwtabfile.write(outfile, file_format="mwtab")
The MWTabFile class has a strange init. It requires a "source" input that is supposed to be the path to the file I think, but then self.source is never used by any of its methods. Similarly self.study_id and self.analysis_id are set in the read method and not used. self.header is also set in the read method and then used in the print_file method rather than just printing the header when it is needed like every other section. I feel like this should be retooled so it can accept an initial JSON in the init, and maybe clean up some of the unused stuff, but I don't know that they aren't used in other parts of the package.
The package may have been designed to handle URL, zip, etc., but not the MWTabFile class. Interpreting URL, filepath, or zip appears to be done by the GenericFilePath class, and then the read method of the MWTabFile class is used to read the contents.
The point is that the filename (source) in the init is pointless and is not used by the class at all regardless of what it was designed to do. The value the MWTabFile object is initialized with is irrelevant. It would make more sense to not require a source. Currently the init really doesn't do anything at all, but it could be expanded to accept an optional json input and initialize with that value to make this problem easier.
Example of how this might look:
mwfile = mwtab.mwtab.MWTabFile(mwtab_json)
with open(path_to_txt_open, 'w', encoding='utf-8') as outfile:
mwfile.write(outfile, file_format='mwtab')
This is an example of code that will work just fine even though the source is nonsense:
mwtabfile = mwtab.mwtab.MWTabFile("asdf")
mwtabfile.read("path_to_a_real_file")
I did get the tempfile suggestion working:
with tempfile.TemporaryFile(mode="w+", encoding="utf-8") as tp:
tp.write(json.dumps(output_json))
tp.seek(0)
mwfile = mwtab.mwtab.MWTabFile("")
mwfile.read(tp)
with open(args["<output-name>"], 'w', encoding="utf-8") as outfile:
mwfile.write(outfile, file_format="mwtab")
Notice how it works just fine even though MWTabFile is initialized with an empty string.
I can't decide if the tempfile solution is more or less hacky than the one I already put in the first comment.
I only opened the issue to suggest improving the API of the package so that you can directly save an mwTab formatted JSON as its text version, rather than having to save the JSON to file first. I further looked at what might be the best way to do this and offered some comments on what I found. The actual particulars of how to make this improvement don't really matter to me. I wanted to open the issue and document the need for this because I need it for MESSES convert, but I have 2 ways around saving out and reading back in, so it isn't urgent. I only respond to the other details to try and clear up confusion.
It may make more sense to have the init take no parameters.
Then the read method could take an optional source parameter.
Now additional creation methods can be added as staticmethods or
classmethods.
The current source parameter is unused, so why would the read method need one? What would it do with it?
Also, the read method is using _is_json and the OrderedDict.update
inherited method to populate the MWTabFile object.
(_is_json method has an incorrect :rtype and the json_str variable is
misnamed, since it is a json data structure.)
If OrderDicts are used to create the data structure, then a simple
mwtabfile.update call should absorb the data structure.
Look at the second code block I put in the comment that started this issue. I also found this, but an update alone isn't enough due to the header being handled differently from the other sections for some reason. This is why I commented some about it.
I highly recommend reading these issues on GitHub itself and not in the email so it is easier to see quotes, code blocks, tables, etc. that I formatting to appear well when viewed on the issue page.
I see what you are saying about the header data member.
And the study_id and analysis_id data members need to be set.
But, it does not matter that no code is accessing self.source.
We have it in the MWTabFile object to keep up with it.
Not every data member needs to be programmatically used by the methods of the object.
After looking at the init, it makes more sense to have a default value for source of "".
Probably best to add a staticmethod/classmethod to absorb a data structure and return new MWTabFile object.
Call this creation method from_dict.
Also, from_string and from_file would be useful creation methods to add.
study_id and analysis_id are also not used.
I can understand that a class can have data members that are meant to just be convenience for access or something, but generally you take great care with these so their values are always accurate. For instance, with this source member you would have it be required when reading in a file so that it is always accurate with what the data source actually is, or set it to None if the data is read in without it. Members like study_id and analysis_id make less sense for a class like this since it is a glorified dictionary and those fields are already stored in it.
Those class methods could work. They are just roundabout initializers. As long as easy creation from a dict is achieved, whether it's through the init or a class method isn't all that different, in the end I end up with the same object I can use to save the JSON to text.
After meeting decided to:
- Make extra data members properties and put code in the read method try block there.
- Add from_dict class method that sets "source" as "internal dict at <item_id>".
I want to clarify something with making the data members properties. Do we want to make them read-only? To make them otherwise would require keeping up with whether they have been manually set or modifying the underlying dictionary the members pull from. Making them read-only would technically break backward compatibility, but none of them had any expectation of being modified or needing to be modified, so I think it would be acceptable for them to be read-only.
I don't think there is a way to make these properties without breaking some behavior. Currently, the values are initialized to empty strings and then the read method tries to set them to something else based on the information in the file being read in. If you change them to properties with a getter that tries to give you the underlying value from the dictionary then they will no longer return the empty string value from the current initialization.
# Current
temp = MWTabFile("source")
print(temp.study_id)
# prints the empty string
# Change to properties
temp = MWTabFile("source")
print(temp.study_id)
# raises a keyerror because the underlying dictionary is empty
Is this acceptable or what we want? The change in behavior has to happen to support manually creating your own dictionary I think. I don't think the class was made with that in mind. It was only considering reading from an already existing file and is not expecting changes to occur to the underlying dictionary.
You have to raise an error to get the current same behavior for the read method. Unless you want to have the read method copy what the getter already does. You would also lose part of the point of having a managed property, so that there isn't mismatch between the property and the underlying dictionary.
There are a few issues we have to balance here.
- The data members not matching the underlying dictionary. Currently, this is possible if you don't use the read method to populate the dictionary or if you change something after using the read method.
- Trying to keep the functionality of the data members the same.
- Raising an error in the read method if the data members can't be created due to the file missing "METABOLOMICS WORKBENCH".
- Letting the user know when the data members can't be created from the underlying dictionary. This is similar to 3. It's about being consistent between populating the dictionary manually or using the read method. If you populate using the read method then an error is raised if a data member can't be created, so if you populate manually shouldn't you get a similar error?
I think something has to give here.
I'm not sure why the read method raises an error. It doesn't actually stop the dictionary from being populated, the error is raised after the underlying dictionary is updated from the file contents. You could still use the writing methods after getting the error and it will just print with an empty string or not at all. If there is going to be error checking shouldn't it be on the write out as well? Maybe we could get rid of errors and leave that to validation. The data members just try to create themselves from the underlying dict and if they can't then return the empty string, and it would be up to the user to validate before writing.
I can't decide if it's better to throw an error, return the empty string, or return None if the data member can't be created. I don't like the empty string because a user could think that the underlying value is truly an empty string. I suppose they could get this impression with None as well though. From the stand point of trying to use this class to manually create an MWTab file I personally would rather have the errors, but I would also just use the validation.
The header property looks something like:
'#METABOLOMICS WORKBENCH STUDY_ID:ST000122 ANALYSIS_ID:AN000204'
The internal dictionary for the "METABOLOMICS WORKBENCH" key looks something like:
{'STUDY_ID': 'ST000122',
'ANALYSIS_ID': 'AN000204',
'VERSION': '1',
'CREATED_ON': '2016-09-17'}
The header is created by just pulling out all of the key value pairs in this dict that aren't "VERSION" or "CREATED_ON". It's not too difficult to parse them back out of the header and modify or add the pairs to the dict, but I'm wondering how strict to be when trying to set a header. Currently, if the header trying to be set doesn't match r"#METABOLOMICS WORKBENCH( [A-Z_]+:\w+)*" then I raise an error. I'm not sure if this is too strict or maybe not strict enough.
I'm also concerned about user's expectations regarding setting the header. If someone tries to set something like "#METABOLOMICS WORKBENCH KEY1:asdf KEY2:qwer", then this will add the keys to the dictionary, but if you access the header after that it would return '#METABOLOMICS WORKBENCH STUDY_ID:ST000122 ANALYSIS_ID:AN000204 KEY1:asdf KEY2:qwer', which is different from what was set. Should there be a check to see if the getter returns something different than what was set and either print a message or raise an error or something?