prof18/RSS-Parser

Illegal characters in RSS XML code crash the library

gregko opened this issue · 1 comments

Describe the bug
If the XML code of RSS feed contains illegal characters, for example <, > instead of &lt; &gt;, the library crashes, with XmlPullParserException:

org.xmlpull.v1.XmlPullParserException: END_TAG expected (position:START_TAG <p data-pm-slice='1 1 []'>@20:31 in java.io.InputStreamReader@6a46b11)

I'm not sure how to best deal with this. Use something like HTML Tidy, which also has XML mode? Any other good "tidy" parser for XML?

The link of the RSS Feed
https://www.uen.org/feeds/rss/news.xml.php

This RSS feed has explicit HTML tags inside <description>...</description>, example:

  | <description>
  | <p data-pm-slice="1 1 []">UEN is proud to join the Governor's Native American Summit on August 6th at UVU. We will be promoting FNX and hosting two film screenings during the summit. Swing by our booth to learn more!</p> </description>

Hi,
first of all sorry for the late response.
The library assumes that the provided feed is correct. The description content should be wrapped inside a CDATA block to avoid problems.

Unfortunately, I can't check for unescaped chars on the library side, since It would be non-performant to check every time since it's a corner case.

In your case, maybe I would use a regex to find the desc block and then use the StringEscapeUtils from Apache Commons to replace the bad strings.