jeff-hughes/shellcaster

Feed content parsing truncation (podcast.hernancattaneo.com)

Closed this issue · 1 comments

$ pacman -Q shellcaster 
shellcaster 1.0.0-1

The feed http://podcast.hernancattaneo.com/feed.xml has embedded <br> elements and whatnot which seems to be occasionally confusing the parser, some times it's working and sometimes not so it's probably very specific elements tripping it up. The latest podcast description is getting cut off after the second track, but then the next one looks OK e.g.

The last screenshot "Download..." is usually the final line in every entry, so the first two show the truncated behaviour.

Screenshot at 2020-08-15 07-59-33

Screenshot at 2020-08-15 07-59-40

Screenshot at 2020-08-15 07-59-44

Thanks for flagging this. Taking a look at the XML file I can immediately see what the issue is -- it's getting choked up on the ampersands (&). In HTML/XML, ampersands can be used to designate the start of a special HTML entity (e.g., &ldquo; and &rdquo; give you curly double quotes). The problem is that depending on what program you use to create a podcast, it could be inserting the special character directly, or using these HTML entity codes, or perhaps even some mix of the two. Browsers have gotten much more flexible with Unicode characters over the years, but the flexibility means that HTML parsing ends up being a pain.

Anyway, shortly before releasing v1.0 I had added a small library to parse these HTML entity codes and convert them to the Unicode equivalent, but I'll admit I didn't do as much testing with that library as I should have. It looks like it's just getting choked up on plain ampersands that don't designate the start of a special entity. I can add a quick fix in the next week or so that should fix most of this behaviour, but I'll have to think about whether to add a more complete solution to catch more of the edge cases where you have a mix of some direct Unicode and some entity codes. But I can likely get the quick solution out within the next week, so keep an eye out for a v1.0.1. There are a couple other little bugs that people have pointed out, so I will likely round up all of those for a patch release.