InetSource throws error on apparently working URL
mfeblowitz opened this issue · 10 comments
Encountered when using InetSource to access an RSS URL - the URL appears to be correct but:
InetSourceStream,InetSource M[InetSourceStream.cpp:checkURI:704] - URL https://www.justice.gov/feeds/justice-news.xml?type=All&component[646]=646 is invalid, "The Uniform Resource Identifier string https://www.justice.gov/feeds/justice-news.xml?type=All&component[646]=646 specified in the URIList parameter contains a syntax error. The Processing Element will shut down now."
Thanks - we’ll give it a try. Not sure whether our code URL-encodes. Also not sure whether InetSource is crashing the PE or our code is. If it's the toolkit, would that be by design?
Looks like the URI checker is rejecting the URL from this regex:
https://github.com/IBMStreams/streamsx.inet/blob/master/com.ibm.streamsx.inet/com.ibm.streamsx.inet/InetSource/URIHelperCpp.cgt#L40
RFC2396 says use of [
, ]
is unwise.
Unwise, indeed. Dated 1998. Superseded anywhere, or just ignored?
The InetSource operators terminates the PE if a statically applied uri fails to pass the uri check.
An invalid uri which is applied dynamically is ignored and logged, but does no PE termination.
Superseded anywhere, or just ignored
Not sure if "ignored" is correct, could it just be a browser is "correcting" the URI by encoding it, that's what chrome did for me.
So, we'll try url-encoding the rss feed url, to see whether that makes the difference. That could tell us that our code should probably take that on as the urls go into the source url file.
I'm wondering whether there could be a flag added to the operator, to choose whether bad urls should be ignored or whether the application should terminate.
I'd be happy with an InsetSource operator parameter of type 'boolean', maybe named 'terminateOnInvalidURL', with a default value of 'true'.