Recording, which contains invalid utf-8 byte breaks proper display of itself and surrounding records
Closed this issue · 12 comments
Hello, as discussed in gitter:
Initial definition was wrong due to undefined root cause. See second update for a better description.
Apparently, the root cause for this was different than initially believed.
It is not the number of records that causes the problem, but a special character in either Title/Description or filename.
In my case, a special character that came from the EPG caused a file to be called:
Mar 23 13:55:00 localhost tvheadend[27006]: dvr: /recordings/БНТ-1-И-в-Рая-има-Ад-(80-години-от-рождението-на-Атанас-Киряк<D0>-2019-03-23_14-00.mkv
from adapter: ...
By cleaning up the special character from the filename, the record in question and all the ones around it that had the problem were fixed.
Steps to reproduce:
Prereqs: You need a few recordings that contain non-ASCII characters, any UTF-8 encoded strings would work, or cyrillic letters for example.
1.) Make a new recording, let it finish
2.) Go to the filename and rename it while tvheadend is running:
cd into the recording folder and run:
mv originalFile.mkv originalFileUpdated-`echo D0|xxd -r -p`.mkv
3.) Go into the iOS app and see how the issue reproduced.
How is it expected to work:
If this breaks, it should only break the encoding of the item that has the special character and not multiple around it(currently 20 items overall got the issue). If possible, it should just fail to display the single character in this case '�'.
Cheers,
-N
can you send me a file with the data collected from the call to api/dvr/entry/grid_finished
which contains the D0 character?
this way I might be able to add this to the unit tests. If you can, also include here the headers from the http reply received.
ok, so further investigation and I'm understanding one thing: 0xD0 is not the problematic character, 0xb5 is the problematic byte.
and 0xB5 is the problematic byte because it doesn't exist in UTF8.
The issue is simple, because B5 is not a valid utf8 character, this file is not recognised as a valid UTF8 character encoding and thus fails reading it. Remove the incorrect B5 byte and the file is now fully readable.
I don't know how to fix this because I can't simply clear utf8 encoded data because I have no idea if the data is really suppose to be UTF8. Let me try to explain myself: there's some users that have badly misconfigured their encoding settings (either because of lack of knowledge or because the satellite provider screwed them and are encoding with some charset and saying it's utf8). Because of this, https://github.com/zipleen/tvheadend-ios-lib/blob/master/tvheadend-ios-lib/TVHJsonUTF8AutoCharsetResponseSerializer.m#L34 I have this little hack that will attempt to read a UTF8 file, and if it doesn't work it will attempt to read it with the rest of the encodings to attempt to see if it can read the file - this actually "automatically fixes" this user's issues because the encoding is eventually "found" and everything works out.
The issue with your example is that the UTF8 decoding fails (because B5 is not a valid character for utf8) and the next encoding (latin1) succeeds - which gives your awesome gibberish.
Unfortunately I can't go clean "invalid utf8 characters" from the source, because those could be valid ISO-8859-1 characters.
Therefore I don't know how to fix this... and I also noticed that the "auto" conversion only works for NSUTF8StringEncoding and NSISOLatin1StringEncoding, because as soon as Latin is attempted, it normally succeeds (producing garbage).
Any help would be appreciated fixing this, but I would deem this "fix charset in tvheadend" or figure out a way to clear the broken EPG data that's feeding invalid characters. Thoughts?
[[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding]
this code in obj-c with your file input (which contains the 0xB5 byte) fails and I have no idea how to clean up this data.
If you want something interesting, open safari and open your file with it - it will show up with garbage latin1 converted mess. Open your file in firefox and magically firefox will ignore that single 0xB5 byte but still interpret the file as utf8.
this should mean that if you open tvheadend's webUI from safari, it will display garbage data. Open it with firefox and it should display only that broken character (untested).
also, this 0xB5 byte is invalid JSON, so in javascript land you shouldn't be able to create this character - this is only possible because tvheadend is incorrectly creating this byte, which some browsers might ignore but others might completely break.
Thanks for the investigation. I can create a ticket with tvheadend about this, though tvheadend allows it and it also parses it alright(it knows it has to use utf-8, as this is a configuration option). It even handles a file with such a character in its name on the local filesystem alright.
I have been using Firefox and Chrome on Linux to access the HTSP services and it does work alright there. I will check out safari and report.
If you cannot fix this in the iOS app, can you attempt not to break the encoding of the whole batch of nearby 19 records? As right now, if such invalid byte exists, iOS app makes the record unreadable due to what you explained, but it also makes the whole batch of records (additional 19 records) latin encoded, which is wrong as they are all utf-8 encoded and they do not contain invalid utf-8 bytes.
If tvheadend API exposes its encoding setting somewhere, perhaps the APP can read this, and based on that use the defined encoding, instead of attempting utf-8, then latin and so on? Would that work?
ie: :9981/api/dvr/config/grid
"charset":"UTF8" for the profile that was used ...
checked on Safari, no issues there as well. It just displays the special character as part of the filename, same way as bash does if you 'ls' the file. The overall encoding remains utf-8.
ok, in that case I need help on out to make [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding]
this call succeed with invalid UTF8 data.
Let me know how I can help here.
Cheers,
-N
Hi, just happened again. An invalid unicode character from EPG, transitioned into tvheadend filename which then broke the iOS app list, as it is in Upcoming, only this section is affected now.
tvheadend log recording with invalid unicode byte in the filename
OK, I will perform the unicode invalid byte filtering before uploading to hts, thus I should avoid seeing this bug in the future.
Feel free to close this if a solution is not likely to be easily implemented.
Thanks,
-N
I have added an upstream bug report for this at https://tvheadend.org/issues/5668
Fixed in 4.2.8 according to tvheadend issue