ContentMine/quickscrape

QS hangs indefintely

Closed this issue · 6 comments

@petermr commented on Thu Jun 02 2016

This URL has been retried and fails for ca 10 mins...

http://dx.doi.org/10.4172/2157-7471.1000s4-003

@blahah commented on Thu Jun 02 2016

should be in the quickscrape repo?


@tarrow commented on Thu Jun 02 2016

Yes, I will move it :)

Here are some DOI roots that hang. It's possible that some of these are due to paywalls.

http://dx.doi.org/10.1017/s0030605316000028
http://dx.doi.org/10.5376/pgt.2015.06.0009
http://dx.doi.org/10.5586/asbp.2006.008
http://dx.doi.org/10.4172/2157-7471.1000s4-003
http://dx.doi.org/10.2903/j.efsa.2013.3069
http://dx.doi.org/10.11623/frj.2013.21.3.25
http://dx.doi.org/10.1094/pdis-02-11-0078-sr.testissue
http://dx.doi.org/10.1017/s0021859613000543
http://dx.doi.org/10.1007/s40858-015-0043-7.

For the first url we get an error 520 from cloudflare. Why we get this error page when we aren't using thresher but instead use curl I'm not sure. The error looks like this:

{ request: 
   { debugId: 1,
     uri: 'http://journals.cambridge.org//abstract_S0030605316000028',
     method: 'GET',
     headers: 
      { referer: 'http://journals.cambridge.org//abstract_S0030605316000028',
        host: 'journals.cambridge.org' } } }
{ response: 
   { debugId: 1,
     headers: 
      { date: 'Tue, 14 Jun 2016 08:48:17 GMT',
        'content-type': 'text/html; charset=UTF-8',
        'transfer-encoding': 'chunked',
        connection: 'close',
        'set-cookie': [Object],
        pragma: 'no-cache',
        'x-frame-options': 'SAMEORIGIN',
        server: 'cloudflare-nginx',
        'cf-ray': '2b2c85b94b9f0a6c-LHR' },
     statusCode: 520,

Unfortunately since we don't know the final origin there isn't much we can do. I'm now working on making thresher just move on from these errors to the next url.

Thanks - this is a useful first step.
We need enough output that we can log this and - perhaps - create blacklists. e.g. if a publishers fails consistently - say 100/100 then it's a waste to continue. If they fail 50/100 it will depend on the cost of tiemouts. If it's 1/100 then it's worthwhile.

This is almost certainly a duplicate of #62