sbraz/pymediainfo

xml reading is outdated

srirams opened this issue · 12 comments

it seems the new xml format includes a namespace as well....

changing the above to:

        ns = '{https://mediaarea.net/mediainfo}'
        if xml_dom.tag == "File":
            xpath = "track"
        elif xml_dom.tag == f"{ns}MediaInfo":
            ET.register_namespace('', ns)
            xpath = f"{ns}media/{ns}track"
        else:
            xpath = "File/track"

node_name = el.tag.lower().strip().strip('_')

adding:

             ns = '{https://mediaarea.net/mediainfo}'
             if node_name.startswith(ns):
               node_name = node_name[len(ns):]

but IMHO the better path may be to dump xml output and use json instead. Ran into this problem because I was getting an error parsing a file (xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 29, column 11), because of some encoding issues- even though the xml seems to be fine, so as a workaround I was reading the xml, converting to ascii and feeding it back in to MediaInfo.

sbraz commented

Hi,

it seems the new xml format includes a namespace as well....

What new format? Is this something in the MediainfoLib's git repo? Version 20.03 works fine, can you show me a way to reproduce this?
If you mean --output=XML as opposed to --output=OLDXML, I am aware of it but I can't migrate to the new output without breaking the track structure. I don't plan on changing the way tracks are formatted until the library itself drops support for OLDXML. In the meantime, there might be ways to get the kind of output you want by passing extra parameters (see my next answer).

If you're using the MediaInfo.parse method, you should not even notice that the XML output method was renamed.

use json instead

The JSON output is also quite different so I can't use it without breaking everything either. You can get a JSON str if you set output="JSON". Please check out the documentation and let me know if it helps.

because of some encoding issues

There is an encoding_errors parameter for that. Can you upload a file that exhibits the issue? I remember someone requesting that parameter but I don't remember having a test file. Such a bug should be reported to MediaInfo's upstream and I can take care of that.

oops, I didn't realize there was an OLDXML. I was taking the --output=XML and trying to load it back in.

this is the file I'm having problems with:

<?xml version="1.0" encoding="UTF-8"?>
<MediaInfo
    xmlns="https://mediaarea.net/mediainfo"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="https://mediaarea.net/mediainfo https://mediaarea.net/mediainfo/mediainfo_2_0.xsd"
    version="2.0">
<creatingLibrary version="20.03" url="https://mediaarea.net/MediaInfo">MediaInfoLib</creatingLibrary>
<track type="General">
<VideoCount>1</VideoCount>
<AudioCount>1</AudioCount>
<FileExtension>wmv</FileExtension>
<Format>Windows Media</Format>
<FileSize>306608820</FileSize>
<Duration>473.307</Duration>
<OverallBitRate>5182409</OverallBitRate>
<OverallBitRate_Maximum>5140448</OverallBitRate_Maximum>
<FrameRate>29.970</FrameRate>
<FrameCount>14185</FrameCount>
<StreamSize>3219033</StreamSize>
<HeaderSize>1046</HeaderSize>
<DataSize>306604706</DataSize>
<Encoded_Date>UTC 2011-11-14 19:26:36.000</Encoded_Date>
<File_Created_Date>UTC 2019-03-24 01:27:12.464</File_Created_Date>
<File_Created_Date_Local>2019-03-23 20:27:12.464</File_Created_Date_Local>
<File_Modified_Date>UTC 2011-12-30 09:54:16.000</File_Modified_Date>
<File_Modified_Date_Local>2011-12-30 04:54:16.000</File_Modified_Date_Local>
<Copyright>�꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
</track>
<track type="Video">
<StreamOrder>0</StreamOrder>
<ID>1</ID>
<Format>VC-1</Format>
<Format_Profile>Main</Format_Profile>
<CodecID>WMV3</CodecID>
<Duration>473.307</Duration>
<BitRate>5000000</BitRate>
<Width>1920</Width>
<Height>1080</Height>
<PixelAspectRatio>1.000</PixelAspectRatio>
<DisplayAspectRatio>1.778</DisplayAspectRatio>
<FrameRate>29.970</FrameRate>
<FrameCount>14185</FrameCount>
<ColorSpace>YUV</ColorSpace>
<ChromaSubsampling>4:2:0</ChromaSubsampling>
<BitDepth>8</BitDepth>
<ScanType>Progressive</ScanType>
<Compression_Mode>Lossy</Compression_Mode>
<StreamSize>295816875</StreamSize>
<extra>
<Duration_Source>General_Duration</Duration_Source>
</extra>
</track>
<track type="Audio">
<StreamOrder>1</StreamOrder>
<ID>2</ID>
<Format>WMA</Format>
<Format_Version>2</Format_Version>
<CodecID>161</CodecID>
<Duration>473.307</Duration>
<BitRate>128000</BitRate>
<Channels>2</Channels>
<SamplingRate>44100</SamplingRate>
<SamplingCount>20872839</SamplingCount>
<BitDepth>16</BitDepth>
<StreamSize>7572912</StreamSize>
<StreamSize_Proportion>0.02470</StreamSize_Proportion>
<extra>
<Duration_Source>General_Duration</Duration_Source>
</extra>
</track>
</media>
</MediaInfo>

edit: I'm reading the file with:

xml = pymediainfo.MediaInfo.parse(file_path, encoding_errors="replace", output="OLDXML")

sbraz commented

I was taking the --output=XML and trying to load it back in.

Any reason why you were not using the built-in parse method? It's faster and more portable (no need for the mediainfo binary, you just need the library and it is bundled in the Windows/OSX wheels).

this is the file I'm having problems with:

Ah, I see. I need the file itself to create an issue though. Can you attach it please (just the few first KiBs should be enough)?
I wonder if JSON output is broken as well.

Sorry, should have been clearer. I'm using the built-in parse and storing the json from to_json. When I ran into the problem with this file, I tried to workaround it by using output="XML" from parse, converting to ascii and loading it back in. Unfortunately I didn't realize I should have used "OLDXML" instead :).

I've included a sample file below:

sample.zip

I think its the xml.etree.ElementTree that doesn't like the unicode in the <Copyright> field, although it seems to be valid xml.

sbraz commented

Apparently it is invalid XML:

$ mediainfo --output=OLDXML sample.wmv  | xmllint --format -
-:12: parser error : Char 0xFFFE out of allowed range
<Copyright>�꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
           ^
-:12: parser error : PCDATA invalid Char value 65534
<Copyright>�꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
           ^

The first character is not a valid unicode character according to Wikipedia:

In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the character U+FFFE, which is defined by Unicode as a "non character" that should never appear in the text.

This can mean two things:

  • MediaInfo has a bug and misinterprets the field
  • The file was created with a broken copyright field

Apparently other MediaInfo-based libraries have had similar issues in the past:
mhor/php-mediainfo#92

Do you know if there is any reason why this file's copyright contains Chinese characters?

i dont know if the same problem or not, in my environment The XML output miss some parameters like codec value, However it's shown on text Format with -f option;


~/Music$ mediainfo 1335588995-tennis_prog_pal_h264.ts -f
General
Count : 330
Count of stream of this kind : 1
Kind of stream : General
Kind of stream : General
Stream identifier : 0
ID : 1
ID : 1 (0x1)
Count of video streams : 1
Count of audio streams : 1
Video_Format_List : AVC
Video_Format_WithHint_List : AVC
Codecs Video : AVC
Audio_Format_List : MPEG Audio
Audio_Format_WithHint_List : MPEG Audio
Audio codecs : MPEG-1 Audio layer 2
Complete name : 1335588995-tennis_prog_pal_h264.ts
File name : 1335588995-tennis_prog_pal_h264
File extension : ts
Format : MPEG-TS
Format : MPEG-TS
Format/Extensions usually used : ts m2t m2s m4t m4s tmf ts tp trp ty
Commercial name : MPEG-TS
Internet media type : video/MP2T
Codec : MPEG-TS
Codec : MPEG-TS
Codec/Extensions usually used : ts m2t m2s m4t m4s tmf ts tp trp ty
File size : 9860224
File size : 9.40 MiB
File size : 9 MiB
File size : 9.4 MiB
File size : 9.40 MiB
File size : 9.403 MiB
Duration : 35766.031250
Duration : 35 s 766 ms
Duration : 35 s 766 ms
Duration : 35 s 766 ms
Duration : 00:00:35.766
Duration : 00:00:34:01
Duration : 00:00:35.766 (00:00:34:01)
Overall bit rate mode : CBR
Overall bit rate mode : Constant
Overall bit rate : 2204864
Overall bit rate : 2 205 kb/s
Frame rate : 25.000
Frame rate : 25.000 FPS
Frame count : 851
Stream size : 943626
Stream size : 922 KiB (10%)
Stream size : 922 KiB
Stream size : 922 KiB
Stream size : 922 KiB
Stream size : 921.5 KiB
Stream size : 922 KiB (10%)
Proportion of this stream : 0.09570
File last modification date : UTC 2014-10-07 12:25:25
File last modification date (local) : 2014-10-07 14:25:25
OverallBitRate_Precision_Min : 2204833
OverallBitRate_Precision_Max : 2204895

Video
Count : 342
Count of stream of this kind : 1
Kind of stream : Video
Kind of stream : Video
Stream identifier : 0
StreamOrder : 0-0
ID : 289
ID : 289 (0x121)
Menu ID : 1
Menu ID : 1 (0x1)
Format : AVC
Format/Info : Advanced Video Codec
Format/Url : http://developers.videolan.org/x264.html
Commercial name : AVC
Format profile : High@L3
Format settings : 1 Ref Frames
Format settings, CABAC : No
Format settings, CABAC : No
Format settings, ReFrames : 1
Format settings, ReFrames : 1 frame
Internet media type : video/H264
Codec ID : 27
Codec : AVC
Codec : AVC
Codec/Family : AVC
Codec/Info : Advanced Video Codec
Codec/Url : http://developers.videolan.org/x264.html
Codec profile : High@L3
Codec settings : 1 Ref Frames
Codec settings, CABAC : No
Codec_Settings_RefFrames : 1
Duration : 34040
Duration : 34 s 40 ms
Duration : 34 s 40 ms
Duration : 34 s 40 ms
Duration : 00:00:34.040
Duration : 00:00:34:01
Duration : 00:00:34.040 (00:00:34:01)
Bit rate mode : CBR
Bit rate mode : Constant
Bit rate : 2000000
Bit rate : 2 000 kb/s
Width : 720
Width : 720 pixels
Height : 576
Height : 576 pixels
Sampled_Width : 720
Sampled_Height : 576
Pixel aspect ratio : 1.067
Display aspect ratio : 1.333
Display aspect ratio : 4:3
Frame rate : 25.000
Frame rate : 25.000 FPS
Frame count : 851
Standard : PAL
Resolution : 8
Resolution : 8 bits
Colorimetry : 4:2:0
Color space : YUV
Chroma subsampling : 4:2:0
Chroma subsampling : 4:2:0
Bit depth : 8
Bit depth : 8 bits
Scan type : Progressive
Scan type : Progressive
Interlacement : PPF
Interlacement : Progressive
Bits/(Pixel*Frame) : 0.193
Delay : 2104.067
Delay : 2 s 104 ms
Delay : 2 s 104 ms
Delay : 2 s 104 ms
Delay : 00:00:02.104
Delay, origin : Container
Delay, origin : Container
Stream size : 8780662
Stream size : 8.37 MiB (89%)
Stream size : 8 MiB
Stream size : 8.4 MiB
Stream size : 8.37 MiB
Stream size : 8.374 MiB
Stream size : 8.37 MiB (89%)
Proportion of this stream : 0.89051
Buffer size : 4000768

Audio
Count : 275
Count of stream of this kind : 1
Kind of stream : Audio
Kind of stream : Audio
Stream identifier : 0
StreamOrder : 0-1
ID : 297
ID : 297 (0x129)
Menu ID : 1
Menu ID : 1 (0x1)
Format : MPEG Audio
Commercial name : MPEG Audio
Format version : Version 1
Format profile : Layer 2
Internet media type : audio/mpeg
Codec ID : 3
Codec : MPA1L2
Codec : MPEG-1 Audio layer 2
Duration : 33984
Duration : 33 s 984 ms
Duration : 33 s 984 ms
Duration : 33 s 984 ms
Duration : 00:00:33.984
Duration : 00:00:33:20
Duration : 00:00:33.984 (00:00:33:20)
Bit rate mode : CBR
Bit rate mode : Constant
Bit rate : 32000
Bit rate : 32.0 kb/s
Channel(s) : 1
Channel(s) : 1 channel
Samples per frame : 1152
Sampling rate : 32000
Sampling rate : 32.0 kHz
Samples count : 1087488
Frame rate : 27.778
Frame rate : 27.778 FPS (1152 SPF)
Frame count : 944
Compression mode : Lossy
Compression mode : Lossy
Delay : 2000.378
Delay : 2 s 0 ms
Delay : 2 s 0 ms
Delay : 2 s 0 ms
Delay : 00:00:02.000
Delay, origin : Container
Delay, origin : Container
Delay relative to video : -104
Delay relative to video : -104 ms
Delay relative to video : -104 ms
Delay relative to video : -104 ms
Delay relative to video : -00:00:00.104
Video0 delay : -104
Video0 delay : -104 ms
Video0 delay : -104 ms
Video0 delay : -104 ms
Video0 delay : -00:00:00.104
Stream size : 135936
Stream size : 133 KiB (1%)
Stream size : 133 KiB
Stream size : 133 KiB
Stream size : 133 KiB
Stream size : 132.8 KiB
Stream size : 133 KiB (1%)
Proportion of this stream : 0.01379


XML Output:

mediainfo --Output=XML 1335588995-tennis_prog_pal_h264.ts -f

Mediainfo (1).txt

Does this is related to XML ? #

sbraz commented

codec value

You must be running an old version of the library for your mediainfo CLI. Codec was replaced with Format a long time ago.

Also the attached MediaInfo XML output is not for 1335588995-tennis_prog_pal_h264.ts so I can't really compare it to the non-XML version. I doubt there is a bug here, but if there is, you need to report it to MediaInfo directly.

Apparently it is invalid XML [...] BOM [...]

I am fixing that, but it will be only in newest version of the lib.

sbraz commented

Hi @JeromeMartinez, thanks! Will this just remove the BOM or also change the endianness? I still don't know if those Chinese characters are valid or if they are some kind of glitch due to the wrong endianness being used.

Will this just remove the BOM or also change the endianness?

Well, I have read too quickly and didn't catch the wrong order (UTF-16BE instead of expected UTF-16LE in WM files).

I added a commit for reordering bytes if such issue appears.
For the example, it makes more sense (Copyright becomes "© Ron Harris").

sbraz commented

For the example, it makes more sense (Copyright becomes "© Ron Harris").

Thanks, it looks more sensible indeed!

sbraz commented

Closing this since it is a MediaInfo issue.