String Index Error on perfectly normal URLs

Question

String Index Error on perfectly normal URLs

rivermont opened this issue 7 years ago · 1 comments

rivermont commented 7 years ago

Checklist

Same issue has not been opened before.

Expected Behavior

No errors.

Actual Behavior

Seemingly randomly, crawling a url will fail with a

string index out of range

error. There doesn't seem to be anything wrong with the URLs:

http://www.denverpost.com/breakingnews/ci_21119904
https://www.publicintegrity.org/2014/07/15/15037/decades-making-decline-irs-nonprofit-regulation
https://cdn.knightlab.com/libs/timeline3/latest/js/timeline-min.js
https://github.com/rivermont/spidy/
https://twitter.com/adamwhitcroft

Steps to Reproduce the Problem

Run the crawler.
Wait a few seconds.

What I've tried so far

Raising the error gave the traceback:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "crawler.py", line 260, in crawl_worker
    if link[0] == '/':
IndexError: string index out of range

Specifications

Crawler Version: 1.6.0
Platform: Linux (Ubuntu 16.04 LTS)
Dependency Versions: All latest

Answer 1 · 2017-10-13T03:50:37.000Z

This happens because some of the links crawled are empty.

I'll send a PR with empty link checking.