MikeMeliz/TorCrawl.py

Error with urllib2

DocKali opened this issue · 6 comments

Hi,

I can't use the tool bacause I already have the same error. I tried with several .onion, i test with python2 and python3 but no way, I can't run the tool.

The error I have is :

Error: <class 'urllib2.HTTPError'>
## Not valid URL 
## Did you forget 'http://'?

I didn't forget http, .onion are alive when I test the tool.
Where does the error come from? Any idea?

Thanks for your help!

Hey DocKali, thanks for reporting an issue!

I'll try to reproduce it at the evening and I'll reach back to you with my results. Probably, something is wrong with my poor "URL canonicalization" and .onion(s) links :) It's something I wanted to change for a long time.

Cheers

I don't know if that can help you but when I try to run your tool with python3, I obtain a different error :
Traceback (most recent call last): File "torcrawl.py", line 48, in <module> from modules.crawler import crawler File "/opt/TorCrawl.py/modules/crawler.py", line 89 except urllib2.HTTPError, e: ^ SyntaxError: invalid syntax
From what I see, it's because urllib2 and python3 didn't work together and I found a tip here : https://stackoverflow.com/questions/41528403/syntaxerror-invalid-syntax-except-urllib2-httperror-e

Does it help you solve this problem?

That's true, you can't run it with python3. Still, I'll try sometime to optimize it for 3 but it's not really in top of the list :)

I didn't find a problem from my side. Can you try this:

mike@mike-vm:/TorCrawl.py$ python2 torcrawl.py -v -u https://www.facebookcorewwwi.onion/
## TOR is ready!
## Your IP: x.x.x.x
## URL: https://www.facebookcorewwwi.onion/
## Folder created: www.facebookcorewwwi.onion
<!DOCTYPE html>
<html lang="en" id="facebook" class="no_js">
...

If not; can you give me some insights about your environment (like OS, python/tor/libs version etc)?

Hi Mike,
Thanks for your answer. It's strange : when I try with the URL your sent, everything is OK and I have the website's HTML Doctype.
But everytime I try to replace Facebook's URL with another (Dread for example), I obtain the following :

$ sudo python2 torcrawl.py -v -u http://dreadditevelidot.onion      
## TOR is ready!
## Your IP: 185.248.160.231
## URL: http://dreadditevelidot.onion
## Folder created: http://dreadditevelidot.onion
Error: <class 'urllib2.HTTPError'>
## Not valid URL 
## Did you forget 'http://'?

Here are the insights you requested :
OS : SMP Debian 4.9.168-1+deb9u4
Python : Python 2.7.13
Tor : Tor 8.5.4
Which libs exactly you need to know the version I have?

I had same results with dread because of an HTTP Error 403: Forbidden.

Try to replace lines 69 to 71 from modules/extractor.py to the following code and see if this happen with your other links (i will include it in next commit):

	except (urllib2.HTTPError, urllib2.URLError) as e:
		print("Error: (%s) %s" % (e, website))
		return None

My first guess would be that Dread block requests without an user-agent.

Hey @DocKali, i'll close this issue as it not seems that this problem is related to the tool.

Feel free to re-open it if you have another issue with urllib2.

Cheers!