String Index Error on perfectly normal URLs
rivermont opened this issue · 1 comments
rivermont commented
Checklist
- Same issue has not been opened before.
Expected Behavior
No errors.
Actual Behavior
Seemingly randomly, crawling a url will fail with a
string index out of range
error. There doesn't seem to be anything wrong with the URLs:
http://www.denverpost.com/breakingnews/ci_21119904
https://www.publicintegrity.org/2014/07/15/15037/decades-making-decline-irs-nonprofit-regulation
https://cdn.knightlab.com/libs/timeline3/latest/js/timeline-min.js
https://github.com/rivermont/spidy/
https://twitter.com/adamwhitcroft
Steps to Reproduce the Problem
- Run the crawler.
- Wait a few seconds.
What I've tried so far
Raising the error gave the traceback:
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "crawler.py", line 260, in crawl_worker
if link[0] == '/':
IndexError: string index out of range
Specifications
- Crawler Version: 1.6.0
- Platform: Linux (Ubuntu 16.04 LTS)
- Dependency Versions: All latest
Hrily commented
This happens because some of the links crawled are empty.
I'll send a PR with empty link checking.