fossasia/query-server

Few search engines crashing the app, taking too much time to parse, e.g. Parsijoo.

rupav opened this issue ยท 22 comments

rupav commented

I'm submitting a ...

  • bug report
  • feature request

Expected behavior:

App should stop parsing if taking too much time.

Steps to reproduce:
go to https://query-server.herokuapp.com/, search chelsea on google, or parsijoo, and select news option. Rest search engines works fine.

screenshot 2018-02-02 10 50 33

I am working on it

Google News search is not been implemented yet.
For Parsijoo, goto: http://khabar.parsijoo.ir/search?q=chelsea There isn't any result corresponding to chelsea in News search.

rupav commented

@bhaveshAn so what do you suggest, should be done in such cases? Because these are crashing the app.
@cclauss @realslimshanky @vaibhavsingh97 please add your suggestions too.

Yes, you can send the PR to deal these undesirable cases.

@rupav also, for google news there should be another issue so please remove the mention of google news from the issue name to avoid confusion.

rupav commented

Sometimes some search engine URLs redirect you to their captcha page.
This is why it keeps on loading forever.

rupav commented

Thanks @raju249 for your input, can you give some suggestion on where to look to solve this.
For example, as I see the logs, I encountered that GET request from such site (e.g. parsijoo with query chelsea) produces 404 error.
So I was trying to print and play around with requests.get(url,...).status_code to handle 404 error case. https://stackoverflow.com/questions/15258728/requests-how-to-tell-if-youre-getting-a-404
But its not getting printed even in my cmd. Tried again, I think clearing cache will help, but again nothing happened.
Any hack around @raju249 @shashank-sharma ?

Not sure about Parsijoo.
But try printing the response content from google.
I think they may be redirecting to captcha page or it may be possible that they have identified this as automated requests and have blocked access to your ip.
Just try print response.text to see what content in printed. Check the body specifically if its html.
Let me know what works.

I think this keeps the server in a infinite request response loop and finally the application crashes.
Please make this as a priority @vaibhavsingh97 .
Its hampering the clients using this server

rupav commented

Ok, will try, will it be fine if I send you the response max by evening, need to rush to collg.

rupav commented

Agreed @raju249 , this is indeed a priority issue, with such query, heroku app crashes, and I am not able to access it for few hours .

Yeah sure @rupav
Take your time.
I would investigate from my side as well.
Not claiming the issue, but would investigate.

I would suggest using google news API for news searches on google instead of scraping the web page.
This would be faster and also wont redirect to any captcha pages.
And I think we can be inside the quota limit.

Let me what you guys think.
@mariobehling @vaibhavsingh97 @anshumanv

rupav commented

@raju249 solved it, give me time, will send a PR by tomm. evening max (got ill today ๐Ÿ˜ซ).
Actually I wasn't using python server.py --dev to show console output. So silly of me. ๐Ÿ˜….
Also response.status_code I used to verify is working fine !

Did you use Google news API ?
Or solved the existing issue with 404?

rupav commented

using 404 validation

@raju249 Google News API is a good ๐Ÿ‘ but what are it's limitations? @rupav Nice catch, Thanks for working on the issue. Also, Regarding Captcha, we can track this too ๐Ÿ‘ and send User some feedback and prevent app from crashing.

Limitations as in ?
Cost ?

@raju249 Yes, and how many free requests we can make ?

So only 1000 requests per day is allowed and we have to attribute News API too if you are in developer circle.
Ref: https://newsapi.org/pricing

How many hits do we get per day (in general, not just news ) ?
@vaibhavsingh97