meyt/linkpreview

Provide the User-Agent header by default

Closed this issue ยท 7 comments

This is the test script I'm using:

from linkpreview import link_preview

preview = link_preview("https://www.tiktok.com/@dtlawson16/video/6766598342848941318", parser="lxml")
print("title:", preview.title)
print("description:", preview.description)
print("image:", preview.image)
print("force_title:", preview.force_title)
print("absolute_image:", preview.absolute_image)
print("site_name:", preview.site_name)

When I run it, it sits there for about 20 seconds and then raises a TimeoutError. But if you were to provide a User-Agent header such as Mozilla/5.0 by default then running the script would output the following:

title: Pranking my grandma ๐Ÿ˜#foryou #fyp | TikTok
description: 1.7M Likes, 11.8K Comments. TikTok video from David Lawson (@dtlawson16): "Pranking my grandma ๐Ÿ˜#foryou #fyp".  original sound - David Lawson.
image: https://lf16-tiktok-web.tiktokcdn-us.com/obj/tiktok-web-tx/tiktok/webapp/main/webapp-desktop/045b2fc7c278b9a30dd0.png
force_title: Pranking my grandma ๐Ÿ˜#foryou #fyp | TikTok
absolute_image: https://lf16-tiktok-web.tiktokcdn-us.com/obj/tiktok-web-tx/tiktok/webapp/main/webapp-desktop/045b2fc7c278b9a30dd0.png
site_name: TikTok

I realize that I can do this myself by following the "Advanced" section of the README, but I think this would be a good thing to have built into linkpreview since there are a lot of sites set up to reject requests that have a User-Agent header with a value like python-requests/2.28.1.

meyt commented

@ataylor32 Thanks for the suggestion, default headers implemented; but it still faces with TimeoutError for your example. because Tiktok uses HTTP/2 which is not supported by Requests, unless you set x-requested-with: XMLHttpRequest header.

Thank you! I just ran my example script (the one with the TikTok URL) using linkpreview 0.6.0 and it worked. I'm not sure why you got a TimeoutError and I didn't.

Am I misunderstanding something here ? I get โ€ฆ

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.gak.co.uk/blog/5-tips-to-master-the-roland-s-1-tweak-synth/#three

I switched to using โ€ฆ

        grabber = LinkGrabber(
            initial_timeout=20,
            maxsize=1048576,
            receive_timeout=10,
            chunk_size=1024,
        )
        content, URL = grabber.get_content(URL)
        link = Link(URL, content)
        preview = LinkPreview(link, parser="lxml")
        print("title:", Fore.GREEN + preview.title + Fore.WHITE)

I can see a header is being used by default (?) in grabber.py.

Yet I still get the error. Am I missing something ?

meyt commented

@Michael-Z-Freeman

Yet I still get the error. Am I missing something ?

No, 403 comes from the Cloudflare.

Now v0.9.0 released for better headers support. in your case, use this:

content, URL = grabber.get_content(URL, headers="imessagebot")

@Michael-Z-Freeman

Yet I still get the error. Am I missing something ?

No, 403 comes from the Cloudflare.

Now v0.9.0 released for better headers support. in your case, use this:

content, URL = grabber.get_content(URL, headers="imessagebot")

OK thanks. However as I found headers alone does not solve 403โ€™s. I ended up using Microsoft Playwright to do the grabber part and it works great ! See https://github.com/Michael-Z-Freeman/word-link-preview

Hello! Maybe another issue/question, but is it possible to parse something behind Cloudflare?

This is an example of such URL.

$ curl -I https://srcd.onlinelibrary.wiley.com/doi/10.1111/cdev.14129
HTTP/2 403
...
server: cloudflare
meyt commented

@pothitos Hi, Maybe you should try something like https://github.com/FlareSolverr/FlareSolverr.