danmactough/node-feedparser

How to handle html from the `Author` field?

chuanqisun opened this issue · 1 comments

Before submitting your issue, please make sure these boxes are checked. Thank you!

Problem feed meta:
image

In the feed item, the author field contains HTML:
image

The parser strips the entire <a> tag from the author property in the output
image

The rss:author property has some additional information but I think it's difficult write generalized extract logic as the structure can differ from feed to feed
image

I wonder if there is an easy way to just get the plaintext within the Author field by Preston So.

Thanks!

That feed is not valid https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Falistapart.com%2Fmain%2Ffeed%2F This is a sad but common problem when parsing feeds. Feedparser doesn't have an opinion about how you should handle invalid feeds -- everyone kind of needs to figure that out for themself given the goals of the project they're working on.

I wonder if there is an easy way to just get the plaintext within the Author field by Preston So

For this specific workaround, the # property contains the plain text parts of the original feed item. So, you would need to recursively parse the rss:author property to pull out the # properties, then join them together with a space.