rivermont/spidy

String Index Error on perfectly normal URLs

rivermont opened this issue · 1 comments

Checklist

  • Same issue has not been opened before.

Expected Behavior

No errors.

Actual Behavior

Seemingly randomly, crawling a url will fail with a

string index out of range

error. There doesn't seem to be anything wrong with the URLs:

http://www.denverpost.com/breakingnews/ci_21119904
https://www.publicintegrity.org/2014/07/15/15037/decades-making-decline-irs-nonprofit-regulation
https://cdn.knightlab.com/libs/timeline3/latest/js/timeline-min.js
https://github.com/rivermont/spidy/
https://twitter.com/adamwhitcroft

Steps to Reproduce the Problem

  1. Run the crawler.
  2. Wait a few seconds.

What I've tried so far

Raising the error gave the traceback:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "crawler.py", line 260, in crawl_worker
    if link[0] == '/':
IndexError: string index out of range

Specifications

  • Crawler Version: 1.6.0
  • Platform: Linux (Ubuntu 16.04 LTS)
  • Dependency Versions: All latest
Hrily commented

This happens because some of the links crawled are empty.

I'll send a PR with empty link checking.