How to handle html from the `Author` field?

Question

How to handle html from the `Author` field?

chuanqisun opened this issue 4 years ago · 1 comments

chuanqisun commented 4 years ago

Before submitting your issue, please make sure these boxes are checked. Thank you!

Review the compressed example.
I tried but the URL is broken.
FeedParser@2.2.10
Node@14.16.1
Problem feed: https://alistapart.com/main/feed/

Problem feed meta:

In the feed item, the author field contains HTML:

The parser strips the entire <a> tag from the author property in the output

The rss:author property has some additional information but I think it's difficult write generalized extract logic as the structure can differ from feed to feed

I wonder if there is an easy way to just get the plaintext within the Author field by Preston So.

Thanks!

Answer 1 · 2021-05-08T20:39:36.000Z

That feed is not valid https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Falistapart.com%2Fmain%2Ffeed%2F This is a sad but common problem when parsing feeds. Feedparser doesn't have an opinion about how you should handle invalid feeds -- everyone kind of needs to figure that out for themself given the goals of the project they're working on.

I wonder if there is an easy way to just get the plaintext within the Author field by Preston So

For this specific workaround, the # property contains the plain text parts of the original feed item. So, you would need to recursively parse the rss:author property to pull out the # properties, then join them together with a space.