Update article scraper
Closed this issue · 4 comments
bloomberg still registers as robot
other websites don't register an article title
Dragnet looks promising
https://github.com/dragnet-org/dragnet
consider layering scrapers, as they might have different strengths, and then falling back to an API
https://mercury.postlight.com/web-parser/
Leads on bloomberg:
https://andrejgajdos.com/how-to-create-a-link-preview/
add to headers : content type, DNT, origin, and referer
https://www.scraperapi.com/blog/5-tips-for-web-scraping
contact them
requests-html with user agent
https://www.pluralsight.com/guides/advanced-web-scraping-tactics-python-playbook
Live Preview update:
https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254
oEmbed
Twitter Card/Facebook Open Graph
meta tags
goose3 title, image, etc.
Using cached data for now. Will continue to look at other options