Few search engines crashing the app, taking too much time to parse, e.g. Parsijoo.
rupav opened this issue ยท 22 comments
I'm submitting a ...
- bug report
- feature request
Expected behavior:
App should stop parsing if taking too much time.
Steps to reproduce:
go to https://query-server.herokuapp.com/, search chelsea
on google, or parsijoo
, and select news
option. Rest search engines works fine.
I am working on it
Google News search is not been implemented yet.
For Parsijoo, goto: http://khabar.parsijoo.ir/search?q=chelsea There isn't any result corresponding to chelsea
in News search.
@bhaveshAn so what do you suggest, should be done in such cases? Because these are crashing the app.
@cclauss @realslimshanky @vaibhavsingh97 please add your suggestions too.
Yes, you can send the PR to deal these undesirable cases.
@rupav also, for google news there should be another issue so please remove the mention of google news from the issue name to avoid confusion.
@realslimshanky updated.
Sometimes some search engine URLs redirect you to their captcha page.
This is why it keeps on loading forever.
Thanks @raju249 for your input, can you give some suggestion on where to look to solve this.
For example, as I see the logs, I encountered that GET
request from such site (e.g. parsijoo with query chelsea
) produces 404 error.
So I was trying to print and play around with requests.get(url,...).status_code
to handle 404 error case. https://stackoverflow.com/questions/15258728/requests-how-to-tell-if-youre-getting-a-404
But its not getting printed even in my cmd. Tried again, I think clearing cache will help, but again nothing happened.
Any hack around @raju249 @shashank-sharma ?
Not sure about Parsijoo.
But try printing the response content from google.
I think they may be redirecting to captcha page or it may be possible that they have identified this as automated requests and have blocked access to your ip.
Just try print response.text
to see what content in printed. Check the body specifically if its html.
Let me know what works.
I think this keeps the server in a infinite request response loop and finally the application crashes.
Please make this as a priority @vaibhavsingh97 .
Its hampering the clients using this server
Ok, will try, will it be fine if I send you the response max by evening, need to rush to collg.
Agreed @raju249 , this is indeed a priority issue, with such query, heroku app crashes, and I am not able to access it for few hours .
Yeah sure @rupav
Take your time.
I would investigate from my side as well.
Not claiming the issue, but would investigate.
I would suggest using google news API for news searches on google instead of scraping the web page.
This would be faster and also wont redirect to any captcha pages.
And I think we can be inside the quota limit.
Let me what you guys think.
@mariobehling @vaibhavsingh97 @anshumanv
@raju249 solved it, give me time, will send a PR by tomm. evening max (got ill today ๐ซ).
Actually I wasn't using python server.py --dev
to show console output. So silly of me. ๐
.
Also response.status_code I used to verify is working fine !
Did you use Google news API ?
Or solved the existing issue with 404?
using 404 validation
Limitations as in ?
Cost ?
@raju249 Yes, and how many free requests we can make ?
So only 1000 requests per day is allowed and we have to attribute News API too if you are in developer circle.
Ref: https://newsapi.org/pricing
How many hits do we get per day (in general, not just news ) ?
@vaibhavsingh97