sbraz/pymediainfo

CLI-equivalent field names

eugenesvk opened this issue · 8 comments

I have a Python script that parses the output of a command-line mediainfo utility

def getMediaInfo(mediafile):
  cmd 	= "mediainfo -f --Output=JSON \"%s\""%(mediafile)
  proc	= subprocess.Popen(cmd, shell=True, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
  stdout, stderr = proc.communicate()
  data = json.loads(stdout) #Decode JSON: Deserialize stdout to a Python object (object→dict)
  return data

I though I'd convert it to using MediaInfo library instead as it seems like the proper way to do it.
However, I've noticed that using your script

from pymediainfo import MediaInfo
def getMediaInfo2(mediafile):
  MIJSON = MediaInfo.parse(mediafile).to_json()
  print(MIJSON)

gives me different field names, e.g. instead of
Encoded_Application or UniqueID I get
writing_application or unique_id

After cursory reading of your plugin seems to suggest it gets information directly in XML (I've tried to use --Output=XML command line option, but it gives the same field values as --Output=XML), but I don't understand where these different field values are coming from
The library code also has only the Encoded_Application text, but nothing for writing_application
I've also tried to pass an extra option like so: MIJSON = MediaInfo.parse(mediafile, mediainfo_options={"Output": "XML"}).to_json(), but this gave me no output at all

Is there a way to get from the library the same field names as the ones I get when I invoke MediaInfo directly from the command line with the XML/JSON formatting options?
I understand that my issue may have nothing to do with your wrapper and there is just some fundamental difference between the command line utility and the library that I don't get, so apologies in advance

sbraz commented

Hi,
The library relies on the OLDXML output format:

xml_option = "OLDXML"

lib.MediaInfo_Option(handle, "Inform", "" if text else xml_option)

You will get more or less the same output as if you were running mediainfo -f <file> except that field names are converted to lower-case, spaces are replaced with underscores and repeated fields are put in a other_<something> attribute. The code is here:

node_name = el.tag.lower().strip().strip('_')
if node_name == 'id':
node_name = 'track_id'
node_value = el.text
other_node_name = "other_%s" % node_name
if getattr(self, node_name) is None:
setattr(self, node_name, node_value)
else:
if getattr(self, other_node_name) is None:
setattr(self, other_node_name, [node_value, ])
else:
getattr(self, other_node_name).append(node_value)

This will explain the UniqueIDunique_id change.

As for Encoded_Application becoming writing_application, it looks like the former is the internal attribute name whereas the latter is the human-readable name. See that file for details.

Apparently, Inform formats XML and JSON return the internal names and OLDXML and the default text format (empty value for Inform) return the human-readable names.

For your use case, I'd do something like that, using text=False to disable XML parsing:

json.loads(pymediainfo.MediaInfo.parse("tests/data/sample.mkv", text=True, mediainfo_options={"Inform": "JSON"}))

Maybe I should mention this in the documentation or add a better format option to the parse method that would directly pass the format to MediaInfo's Inform parameter. What do you think?

Thanks a lot for your prompt and detailed response!

For your use case, I'd do something like that, using text=False to disable XML parsing:

json.loads(pymediainfo.MediaInfo.parse("tests/data/sample.mkv", text=True, mediainfo_options={"Inform": "JSON"}))

This works exactly like I want it to, the output is identical to a command line command and it starts with the 'media': {'@ref': instead of a track, so I don't need to modify any of my parsings and just use it as a drop-in replacement!!!

This will explain the UniqueIDunique_id change.

I figured out a bit later the source of the underscores, but I had no idea about the following, thanks for clarifying:

As for Encoded_Application becoming writing_application, it looks like the former is the internal attribute name whereas the latter is the human-readable name.
Apparently, Inform formats XML and JSON return the internal names and OLDXML and the default text format (empty value for Inform) return the human-readable names.

Documentation
What would've helped me is having a few examples of commands and the corresponding full output so I have the full view of the data scructure. Then I would've just copy&pasted the command that corresponds to the data output I'd like to work with (in my case, identical to what I already have)

Extra options

  • I'd suggest to name this option as Output in addition to Inform as this adds familiarity to the command line users. This also seems to be the way of the library itself: this commit states that

--Output is synonym of --Inform option

  • It would also be great if this option automatically enabled the Text bool as there seems to be no case when you'd need to specify Output but leave Text as default(False), right?
sbraz commented

What would've helped me is having a few examples of commands and the corresponding full output so I have the full view of the data scructure. Then I would've just copy&pasted the command that corresponds to the data output I'd like to work with (in my case, identical to what I already have)

To be honest, I had no idea there was a JSON output :)

I'd suggest to name this option as Output in addition to Inform

My idea is to have an option, maybe named output that would deprecate the text option. Setting it to anything non-default would disable XML processing. What do you think? I would then add example of parse's return with different values for output.

Yes, your idea sounds great, the new output option does look like the brighter future for the text!

sbraz commented

@JeromeMartinez Hi Jérôme, is there a value for Output / Inform that corresponds to the default output format or should I simply ask users to set it to ""? I noticed that mediainfo --Inform=Text works too, but then again so does --Inform=randomtexthere :)

is there a value for Output / Inform that corresponds to the default output format or should I simply ask users to set it to ""?

"" is the more or less official way to say to reset to default.
random text (including "Text") is discarded and default is used too, but there could be a message error in the future.

sbraz commented

@eugenesvk can you please try the new output branch and let me know if the new documentation is fine as well?

@sbraz I've checked the output branch and the documentation and it seems to be working just fine. Thanks for the fix!