Parsing MediaInfo fails on Chinese chars in XML
Fossil01 opened this issue · 12 comments
In the following XML between the tags there are some Chinese chars. SimpleXML doesn't seem to like those and crashes the process.
<Copyright>�꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
ErrorException : simplexml_load_string(): Entity: line 54: parser error : Char 0xFFFE out of allowed range
at /var/www/removed/vendor/mhor/php-mediainfo/src/Parser/AbstractXmlOutputParser.php:18
14| if (mb_detect_encoding($xmlString, 'UTF-8', true) === false) {
15| $xmlString = utf8_encode($xmlString);
16| }
17|
> 18| $xml = simplexml_load_string($xmlString);
19| $json = json_encode($xml);
20|
21| return json_decode($json, true);
22| }
Exception trace:
1 simplexml_load_string("<?xml version="1.0" encoding="UTF-8"?>
<Mediainfo version="19.09">
<File>
<track type="General">
<Count>331</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>General</Kind_of_stream>
<Kind_of_stream>General</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<Count_of_video_streams>1</Count_of_video_streams>
<Count_of_audio_streams>1</Count_of_audio_streams>
<Video_Format_List>VC-1</Video_Format_List>
<Video_Format_WithHint_List>VC-1 (WMV3)</Video_Format_WithHint_List>
<Codecs_Video>VC-1</Codecs_Video>
<Audio_Format_List>WMA</Audio_Format_List>
<Audio_Format_WithHint_List>WMA</Audio_Format_WithHint_List>
<Audio_codecs>WMA</Audio_codecs>
<Complete_name>/mnt/ramdisk/5/54c9f93b-8550-4100-8eeb-328841dc00d6/782247_ohrly-rh131aaso.wmv</Complete_name>
<Folder_name>/mnt/ramdisk/5/54c9f93b-8550-4100-8eeb-328841dc00d6</Folder_name>
<File_name_extension>782247_ohrly-rh131aaso.wmv</File_name_extension>
<File_name>782247_ohrly-rh131aaso</File_name>
<File_extension>wmv</File_extension>
<Format>Windows Media</Format>
<Format>Windows Media</Format>
<Format_Extensions_usually_used>asf dvr-ms wma wmv</Format_Extensions_usually_used>
<Commercial_name>Windows Media</Commercial_name>
<Internet_media_type>video/x-ms-wmv</Internet_media_type>
<File_size>760169</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742.4 KiB</File_size>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09;03</Duration>
<Duration>00:08:09.056 (00:08:09;03)</Duration>
<Overall_bit_rate>12435</Overall_bit_rate>
<Overall_bit_rate>12.4 kb/s</Overall_bit_rate>
<Maximum_Overall_bit_rate>5136894</Maximum_Overall_bit_rate>
<Maximum_Overall_bit_rate>5 137 kb/s</Maximum_Overall_bit_rate>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 FPS</Frame_rate>
<Frame_count>14657</Frame_count>
<HeaderSize>1046</HeaderSize>
<DataSize>759123</DataSize>
<Performer>Ron Harris</Performer>
<Encoded_date>UTC 2012-05-14 00:53:44.000</Encoded_date>
<File_last_modification_date>UTC 2019-12-17 17:20:55</File_last_modification_date>
<File_last_modification_date__local_>2019-12-17 18:20:55</File_last_modification_date__local_>
<Copyright>�꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
<Comment>HD Videos</Comment>
</track>
<track type="Video">
<Count>377</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Video</Kind_of_stream>
<Kind_of_stream>Video</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>0</StreamOrder>
<ID>1</ID>
<ID>1</ID>
<Format>VC-1</Format>
<Format>VC-1</Format>
<Commercial_name>VC-1</Commercial_name>
<Format_profile>Main</Format_profile>
<Internet_media_type>video/vc1</Internet_media_type>
<Codec_ID>WMV3</Codec_ID>
<Codec_ID_Info>Windows Media Video 9</Codec_ID_Info>
<Codec_ID_Hint>WMV3</Codec_ID_Hint>
<Codec_ID_Url>http://www.microsoft.com/windows/windowsmedia/format/codecdownload.aspx</Codec_ID_Url>
<Description_of_the_codec>Windows Media Video 9 - 2-pass VBR</Description_of_the_codec>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09;03</Duration>
<Duration>00:08:09.056 (00:08:09;03)</Duration>
<Bit_rate>5000000</Bit_rate>
<Bit_rate>5 000 kb/s</Bit_rate>
<Width>1920</Width>
<Width>1 920 pixels</Width>
<Height>1080</Height>
<Height>1 080 pixels</Height>
<Pixel_aspect_ratio>1.000</Pixel_aspect_ratio>
<Display_aspect_ratio>1.778</Display_aspect_ratio>
<Display_aspect_ratio>16:9</Display_aspect_ratio>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 (29970/1000) FPS</Frame_rate>
<FrameRate_Num>29970</FrameRate_Num>
<FrameRate_Den>1000</FrameRate_Den>
<Frame_count>14657</Frame_count>
<Color_space>YUV</Color_space>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Bit_depth>8</Bit_depth>
<Bit_depth>8 bits</Bit_depth>
<Scan_type>Progressive</Scan_type>
<Scan_type>Progressive</Scan_type>
<Compression_mode>Lossy</Compression_mode>
<Compression_mode>Lossy</Compression_mode>
<Bits__Pixel_Frame_>0.080</Bits__Pixel_Frame_>
<Stream_size>305660000</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>291.5 MiB</Stream_size>
</track>
<track type="Audio">
<Count>280</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Audio</Kind_of_stream>
<Kind_of_stream>Audio</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>1</StreamOrder>
<ID>2</ID>
<ID>2</ID>
<Format>WMA</Format>
<Format>WMA</Format>
<Commercial_name>WMA</Commercial_name>
<Format_version>Version 2</Format_version>
<Codec_ID>161</Codec_ID>
<Codec_ID_Info>Windows Media Audio</Codec_ID_Info>
<Codec_ID_Url>http://www.microsoft.com/windows/windowsmedia/format/codecdownload.aspx</Codec_ID_Url>
<Description_of_the_codec>Windows Media Audio 9 - 128 kbps, 44 kHz, stereo CBR</Description_of_the_codec>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09.056</Duration>
<Bit_rate>128000</Bit_rate>
<Bit_rate>128 kb/s</Bit_rate>
<Channel_s_>2</Channel_s_>
<Channel_s_>2 channels</Channel_s_>
<Sampling_rate>44100</Sampling_rate>
<Sampling_rate>44.1 kHz</Sampling_rate>
<Samples_count>21567370</Samples_count>
<Bit_depth>16</Bit_depth>
<Bit_depth>16 bits</Bit_depth>
<Stream_size>7824896</Stream_size>
<Stream_size>7.46 MiB</Stream_size>
<Stream_size>7 MiB</Stream_size>
<Stream_size>7.5 MiB</Stream_size>
<Stream_size>7.46 MiB</Stream_size>
<Stream_size>7.462 MiB</Stream_size>
</track>
</File>
</Mediainfo>
")
/var/www/removed/vendor/mhor/php-mediainfo/src/Parser/AbstractXmlOutputParser.php:18
2 Mhor\MediaInfo\Parser\AbstractXmlOutputParser::transformXmlToArray("<?xml version="1.0" encoding="UTF-8"?>
<Mediainfo version="19.09">
<File>
<track type="General">
<Count>331</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>General</Kind_of_stream>
<Kind_of_stream>General</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<Count_of_video_streams>1</Count_of_video_streams>
<Count_of_audio_streams>1</Count_of_audio_streams>
<Video_Format_List>VC-1</Video_Format_List>
<Video_Format_WithHint_List>VC-1 (WMV3)</Video_Format_WithHint_List>
<Codecs_Video>VC-1</Codecs_Video>
<Audio_Format_List>WMA</Audio_Format_List>
<Audio_Format_WithHint_List>WMA</Audio_Format_WithHint_List>
<Audio_codecs>WMA</Audio_codecs>
<Complete_name>/mnt/ramdisk/5/54c9f93b-8550-4100-8eeb-328841dc00d6/782247_ohrly-rh131aaso.wmv</Complete_name>
<Folder_name>/mnt/ramdisk/5/54c9f93b-8550-4100-8eeb-328841dc00d6</Folder_name>
<File_name_extension>782247_ohrly-rh131aaso.wmv</File_name_extension>
<File_name>782247_ohrly-rh131aaso</File_name>
<File_extension>wmv</File_extension>
<Format>Windows Media</Format>
<Format>Windows Media</Format>
<Format_Extensions_usually_used>asf dvr-ms wma wmv</Format_Extensions_usually_used>
<Commercial_name>Windows Media</Commercial_name>
<Internet_media_type>video/x-ms-wmv</Internet_media_type>
<File_size>760169</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742 KiB</File_size>
<File_size>742.4 KiB</File_size>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09;03</Duration>
<Duration>00:08:09.056 (00:08:09;03)</Duration>
<Overall_bit_rate>12435</Overall_bit_rate>
<Overall_bit_rate>12.4 kb/s</Overall_bit_rate>
<Maximum_Overall_bit_rate>5136894</Maximum_Overall_bit_rate>
<Maximum_Overall_bit_rate>5 137 kb/s</Maximum_Overall_bit_rate>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 FPS</Frame_rate>
<Frame_count>14657</Frame_count>
<HeaderSize>1046</HeaderSize>
<DataSize>759123</DataSize>
<Performer>Ron Harris</Performer>
<Encoded_date>UTC 2012-05-14 00:53:44.000</Encoded_date>
<File_last_modification_date>UTC 2019-12-17 17:20:55</File_last_modification_date>
<File_last_modification_date__local_>2019-12-17 18:20:55</File_last_modification_date__local_>
<Copyright>�꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
<Comment>HD Videos</Comment>
</track>
<track type="Video">
<Count>377</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Video</Kind_of_stream>
<Kind_of_stream>Video</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>0</StreamOrder>
<ID>1</ID>
<ID>1</ID>
<Format>VC-1</Format>
<Format>VC-1</Format>
<Commercial_name>VC-1</Commercial_name>
<Format_profile>Main</Format_profile>
<Internet_media_type>video/vc1</Internet_media_type>
<Codec_ID>WMV3</Codec_ID>
<Codec_ID_Info>Windows Media Video 9</Codec_ID_Info>
<Codec_ID_Hint>WMV3</Codec_ID_Hint>
<Codec_ID_Url>http://www.microsoft.com/windows/windowsmedia/format/codecdownload.aspx</Codec_ID_Url>
<Description_of_the_codec>Windows Media Video 9 - 2-pass VBR</Description_of_the_codec>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09;03</Duration>
<Duration>00:08:09.056 (00:08:09;03)</Duration>
<Bit_rate>5000000</Bit_rate>
<Bit_rate>5 000 kb/s</Bit_rate>
<Width>1920</Width>
<Width>1 920 pixels</Width>
<Height>1080</Height>
<Height>1 080 pixels</Height>
<Pixel_aspect_ratio>1.000</Pixel_aspect_ratio>
<Display_aspect_ratio>1.778</Display_aspect_ratio>
<Display_aspect_ratio>16:9</Display_aspect_ratio>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 (29970/1000) FPS</Frame_rate>
<FrameRate_Num>29970</FrameRate_Num>
<FrameRate_Den>1000</FrameRate_Den>
<Frame_count>14657</Frame_count>
<Color_space>YUV</Color_space>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Bit_depth>8</Bit_depth>
<Bit_depth>8 bits</Bit_depth>
<Scan_type>Progressive</Scan_type>
<Scan_type>Progressive</Scan_type>
<Compression_mode>Lossy</Compression_mode>
<Compression_mode>Lossy</Compression_mode>
<Bits__Pixel_Frame_>0.080</Bits__Pixel_Frame_>
<Stream_size>305660000</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>292 MiB</Stream_size>
<Stream_size>291.5 MiB</Stream_size>
</track>
<track type="Audio">
<Count>280</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Audio</Kind_of_stream>
<Kind_of_stream>Audio</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>1</StreamOrder>
<ID>2</ID>
<ID>2</ID>
<Format>WMA</Format>
<Format>WMA</Format>
<Commercial_name>WMA</Commercial_name>
<Format_version>Version 2</Format_version>
<Codec_ID>161</Codec_ID>
<Codec_ID_Info>Windows Media Audio</Codec_ID_Info>
<Codec_ID_Url>http://www.microsoft.com/windows/windowsmedia/format/codecdownload.aspx</Codec_ID_Url>
<Description_of_the_codec>Windows Media Audio 9 - 128 kbps, 44 kHz, stereo CBR</Description_of_the_codec>
<Duration>489056</Duration>
<Duration>8 min 9 s</Duration>
<Duration>8 min 9 s 56 ms</Duration>
<Duration>8 min 9 s</Duration>
<Duration>00:08:09.056</Duration>
<Duration>00:08:09.056</Duration>
<Bit_rate>128000</Bit_rate>
<Bit_rate>128 kb/s</Bit_rate>
<Channel_s_>2</Channel_s_>
<Channel_s_>2 channels</Channel_s_>
<Sampling_rate>44100</Sampling_rate>
<Sampling_rate>44.1 kHz</Sampling_rate>
<Samples_count>21567370</Samples_count>
<Bit_depth>16</Bit_depth>
<Bit_depth>16 bits</Bit_depth>
<Stream_size>7824896</Stream_size>
<Stream_size>7.46 MiB</Stream_size>
<Stream_size>7 MiB</Stream_size>
<Stream_size>7.5 MiB</Stream_size>
<Stream_size>7.46 MiB</Stream_size>
<Stream_size>7.462 MiB</Stream_size>
</track>
</File>
</Mediainfo>
")
/var/www/removed/vendor/mhor/php-mediainfo/src/Parser/MediaInfoOutputParser.php:22
Please use the argument -v to see more details.
@Fossil01 Thanks for reporting this issue. An old pull request consider removing utf8_encode
to solve a "bug".
Could you try to remove this call and see if that solve the problem ? If not I will try to fix this issue this weekend.
@mhor nope same thing happens if I remove those 3 lines.
Thanks for your quick answer, so it's definitively related to xml string returned by mediainfo.
This is looking as an acceptable solution for me, I will try to implement this as soon as possible but if you want feel free to open a pull request with your solution I will be happy to review it.
I'll have a crack at it after Christmas. Cheers.
Completely forgot about this. It seems to work now.
Looks like I am still having this issue.
ErrorException
simplexml_load_string(): Entity: line 10: parser error : Char 0xFFFE out of allowed range
at vendor/mhor/php-mediainfo/src/Parser/AbstractXmlOutputParser.php:18
14| if (mb_detect_encoding($xmlString, 'UTF-8', true) === false) {
15| $xmlString = utf8_encode($xmlString);
16| }
17|
> 18| $xml = simplexml_load_string($xmlString);
19| $json = json_encode($xml);
20|
21| return json_decode($json, true);
Maybe we can use a function like this to strip out invalid chars:
https://stackoverflow.com/a/3466049
Aha. It looks like aca1198 never made it into the master branch and thus in a release.
When I add these lines it seems to fix the issue too:
$xmlString = preg_replace(
'/[\x00-\x08\x0B\x0C\x0E-\x1F]|\xED[\xA0-\xBF].|\xEF\xBF[\xBE\xBF]/',
"\xEF\xBF\xBD",
$xmlString
);
XML it fails on currently:
<?xml version="1.0" encoding="UTF-8"?>
<Mediainfo version="20.03">
<File>
<track type="General">
<Count>331</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>General</Kind_of_stream>
<Kind_of_stream>General</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<Complete_name>/mnt/ramdisk/1/15f4594a-c211-4acc-9f58-cae2b09c8151/160095_[� Kuro Ookami ] Pet Life [ISO DVD-RIP 1920x1080 x264 10bits AC-3] [69A9399D].mkv</Complete_name>
<Folder_name>/mnt/ramdisk/1/15f4594a-c211-4acc-9f58-cae2b09c8151</Folder_name>
<File_name_extension>160095_[� Kuro Ookami ] Pet Life [ISO DVD-RIP 1920x1080 x264 10bits AC-3] [69A9399D].mkv</File_name_extension>
<File_name>160095_[� Kuro Ookami ] Pet Life [ISO DVD-RIP 1920x1080 x264 10bits AC-3] [69A9399D]</File_name>
<File_extension>mkv</File_extension>
<File_size>1048394</File_size>
<File_size>1 024 KiB</File_size>
<File_size>1 024 KiB</File_size>
<File_size>1 024 KiB</File_size>
<File_size>1 024 KiB</File_size>
<File_size>1 023.8 KiB</File_size>
<Stream_size>1048394</Stream_size>
<Stream_size>1 024 KiB (100%)</Stream_size>
<Stream_size>1 024 KiB</Stream_size>
<Stream_size>1 024 KiB</Stream_size>
<Stream_size>1 024 KiB</Stream_size>
<Stream_size>1 023.8 KiB</Stream_size>
<Stream_size>1 024 KiB (100%)</Stream_size>
<Proportion_of_this_stream>1.00000</Proportion_of_this_stream>
<File_last_modification_date>UTC 2021-10-13 08:47:31</File_last_modification_date>
<File_last_modification_date__local_>2021-10-13 10:47:31</File_last_modification_date__local_>
</track>
</File>
</Mediainfo>
I'll have a look this week, thanks. In the mean time I manually edited the file in the vendor dir and added that preg_replace I pasted here before as an ugly temp fix :-)
Closed for now, due to inactivity.