How to handle html from the `Author` field?
chuanqisun opened this issue · 1 comments
Before submitting your issue, please make sure these boxes are checked. Thank you!
-
Review the compressed example.
I tried but the URL is broken. -
FeedParser@2.2.10
-
Node@14.16.1
-
Problem feed: https://alistapart.com/main/feed/
In the feed item, the author field contains HTML:
The parser strips the entire <a>
tag from the author
property in the output
The rss:author
property has some additional information but I think it's difficult write generalized extract logic as the structure can differ from feed to feed
I wonder if there is an easy way to just get the plaintext within the Author field by Preston So
.
Thanks!
That feed is not valid https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Falistapart.com%2Fmain%2Ffeed%2F This is a sad but common problem when parsing feeds. Feedparser doesn't have an opinion about how you should handle invalid feeds -- everyone kind of needs to figure that out for themself given the goals of the project they're working on.
I wonder if there is an easy way to just get the plaintext within the Author field by Preston So
For this specific workaround, the #
property contains the plain text parts of the original feed item. So, you would need to recursively parse the rss:author
property to pull out the #
properties, then join them together with a space.