Provide the User-Agent header by default
Closed this issue ยท 7 comments
This is the test script I'm using:
from linkpreview import link_preview
preview = link_preview("https://www.tiktok.com/@dtlawson16/video/6766598342848941318", parser="lxml")
print("title:", preview.title)
print("description:", preview.description)
print("image:", preview.image)
print("force_title:", preview.force_title)
print("absolute_image:", preview.absolute_image)
print("site_name:", preview.site_name)
When I run it, it sits there for about 20 seconds and then raises a TimeoutError
. But if you were to provide a User-Agent header such as Mozilla/5.0
by default then running the script would output the following:
title: Pranking my grandma ๐#foryou #fyp | TikTok
description: 1.7M Likes, 11.8K Comments. TikTok video from David Lawson (@dtlawson16): "Pranking my grandma ๐#foryou #fyp". original sound - David Lawson.
image: https://lf16-tiktok-web.tiktokcdn-us.com/obj/tiktok-web-tx/tiktok/webapp/main/webapp-desktop/045b2fc7c278b9a30dd0.png
force_title: Pranking my grandma ๐#foryou #fyp | TikTok
absolute_image: https://lf16-tiktok-web.tiktokcdn-us.com/obj/tiktok-web-tx/tiktok/webapp/main/webapp-desktop/045b2fc7c278b9a30dd0.png
site_name: TikTok
I realize that I can do this myself by following the "Advanced" section of the README, but I think this would be a good thing to have built into linkpreview
since there are a lot of sites set up to reject requests that have a User-Agent header with a value like python-requests/2.28.1
.
@ataylor32 Thanks for the suggestion, default headers implemented; but it still faces with TimeoutError
for your example. because Tiktok uses HTTP/2 which is not supported by Requests, unless you set x-requested-with: XMLHttpRequest
header.
Thank you! I just ran my example script (the one with the TikTok URL) using linkpreview 0.6.0 and it worked. I'm not sure why you got a TimeoutError
and I didn't.
Am I misunderstanding something here ? I get โฆ
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.gak.co.uk/blog/5-tips-to-master-the-roland-s-1-tweak-synth/#three
I switched to using โฆ
grabber = LinkGrabber(
initial_timeout=20,
maxsize=1048576,
receive_timeout=10,
chunk_size=1024,
)
content, URL = grabber.get_content(URL)
link = Link(URL, content)
preview = LinkPreview(link, parser="lxml")
print("title:", Fore.GREEN + preview.title + Fore.WHITE)
I can see a header is being used by default (?) in grabber.py.
Yet I still get the error. Am I missing something ?
Yet I still get the error. Am I missing something ?
No, 403 comes from the Cloudflare.
Now v0.9.0 released for better headers support. in your case, use this:
content, URL = grabber.get_content(URL, headers="imessagebot")
Yet I still get the error. Am I missing something ?
No, 403 comes from the Cloudflare.
Now v0.9.0 released for better headers support. in your case, use this:
content, URL = grabber.get_content(URL, headers="imessagebot")
OK thanks. However as I found headers alone does not solve 403โs. I ended up using Microsoft Playwright to do the grabber part and it works great ! See https://github.com/Michael-Z-Freeman/word-link-preview
Hello! Maybe another issue/question, but is it possible to parse something behind Cloudflare?
This is an example of such URL.
$ curl -I https://srcd.onlinelibrary.wiley.com/doi/10.1111/cdev.14129
HTTP/2 403
...
server: cloudflare
@pothitos Hi, Maybe you should try something like https://github.com/FlareSolverr/FlareSolverr.