zero-to-mastery/breads-server

Update article scraper

Closed this issue 4 years ago · 4 comments

aubundy commented 4 years ago

bloomberg still registers as robot
other websites don't register an article title

aubundy commented 4 years ago

Dragnet looks promising
https://github.com/dragnet-org/dragnet

consider layering scrapers, as they might have different strengths, and then falling back to an API
https://mercury.postlight.com/web-parser/

aubundy commented 4 years ago

Leads on bloomberg:
https://andrejgajdos.com/how-to-create-a-link-preview/
add to headers : content type, DNT, origin, and referer
https://www.scraperapi.com/blog/5-tips-for-web-scraping
contact them
requests-html with user agent
https://www.pluralsight.com/guides/advanced-web-scraping-tactics-python-playbook

aubundy commented 4 years ago

Live Preview update:
https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254
oEmbed
Twitter Card/Facebook Open Graph
meta tags
goose3 title, image, etc.

https://www.linkpreview.net/docs/

aubundy commented 4 years ago

Using cached data for now. Will continue to look at other options