MikeMeliz/TorCrawl.py

Byte-String Error

zyndykyt opened this issue · 5 comments

When trying to run python3, I come across this error. Any advice on how to resolve.

$ python3 torcrawl.py -u http://RaNdoMOnioN.onion
Traceback (most recent call last):
File "torcrawl.py", line 193, in
main()
File "torcrawl.py", line 161, in main
checktor(args.verbose)
File "/home/zurs/TorCrawl.py/modules/checker.py", line 50, in checktor
if findwholeword('tor')(checkfortor):
TypeError: cannot use a string pattern on a bytes-like object

Additionally, when I attempt to pull from a txt.file using (-i) as shown in the ReadMe, I get the error that a -u/--URL is required, so where could I alte to allow me to use -i as my sole input?

Thanks.

Hey @jtank13 thanks for the feedback!

I confirmed from my side too those bugs. You can find some solutions below, and if those are also working for you, I'll push them with the next commit.

When trying to run python3, I come across this error. Any advice on how to resolve.
...
TypeError: cannot use a string pattern on a bytes-like object

A simple workaround would be to convert it to a string. Could you try to change the line 50 of checker.py as follows: if findwholeword('tor')(str(checkfortor)): .

Additionally, when I attempt to pull from a txt.file using (-i) as shown in the ReadMe, I get the error that a -u/--URL is required, so where could I alte to allow me to use -i as my sole input?

The line 90 of torcrawl.py isn't necessary, should be deleted.

Hey @jtank13 thanks for the feedback!

I confirmed from my side too those bugs. You can find some solutions below, and if those are also working for you, I'll push them with the next commit.

When trying to run python3, I come across this error. Any advice on how to resolve.

...

TypeError: cannot use a string pattern on a bytes-like object

A simple workaround would be to convert it to a string. Could you try to change the line 50 of checker.py as follows: if findwholeword('tor')(str(checkfortor)): .

Additionally, when I attempt to pull from a txt.file using (-i) as shown in the ReadMe, I get the error that a -u/--URL is required, so where could I alte to allow me to use -i as my sole input?

The line 90 of torcrawl.py isn't necessary, should be deleted.

Hi Mike

Not sure if OP's issue was sorted but I was having basically the exact same issue as the first one mentioned (TypeError).

I implemented both of your suggestions and it seems to have fixed it all up just fine. Just thought I'd let you know.

As a side note (and probably beyond the scope of this app) it might be worth noting that I'm running this within WSL 2 and it doesn't seem to notice I have Tor running.
I usually start the service in my default distro and it works across all the other distros (other apps, including Windows kinda "share" the service) but Torcrawl will only work inside the distro I specifically started the Tor service from
Again, probably well beyond the scope of this app but just thought I'd mention it.

Thanks for your work. This has been super handy and is much appreciated!

Hey @0m3rta13 ,

Thanks for taking the time to confirm that those solutions worked for you, I'll give my best to commit those changes asap. Community's feedback is greatly appreciated at this point, as after migrating the project to python3, several issues occurs, and I'm trying -the hard way- to hunt them down.

Regarding WSL2 and Tor Service, the script's doing a simple check with ps -e (checker.py) to see if tor.service is actually running. Would you mind checking via this command directly into WSL's terminal? I've noted to check it from my side too, and to develop a more sufficient way to check for service's status.

Thank you again for your effort and your kind words, I'll let you know with my results!

Hi @MikeMeliz

No problem regarding the feedback. Probably the least that I can do considering how useful this script has been for me.

As far as the tor check goes your way of checking probably makes the most sense. It doesn't show up when checked with that command which makes sense as the service is owned by the other distro and the services aren't shared, only the sockets (I think that's how it works with WSL 2 at least).
I don't think it would really make much sense for you to change the way the script functions to accommodate this just because of the way WSL works and the fact that it's still under very active development and changes are still being made very often as things tend to break quite often. So that may cause you to have to constantly fix it anytime any small change is made which will probably just get annoying. At least until wsl 2 has reached more stable levels. I dunno. That's just my opinion.

For what it's worth though the way I check to see if tor is working or active across distros is by running

curl --socks5 localhost:9050 --socks5-hostname localhost:9050 -s https://check.torproject.org/ | cat | grep -m 1 Congratulations | xargs

Can't imagine that will be too efficient to implement within the script though.

I've also found that the delay doesn't seem to work. No matter what I set for the -p flag it's always set to 2 seconds. I think I know why this is so will have a look another look the script when I get a chance later.

Lastly, something I think would be very useful to add would be setting a user agent. This would remove the limitations when crawling some sites that sometimes seem to block crawl attempts. Haven't tried to implement this before myself but will have a go at it and if it works out I'll throw up a PR

Thanks again for all your work!

EDIT: Oh I forgot. Also the -o flag doesn't seem to do anything either. Granted it still creates a folder with the sites name and creates the files inside it so it works fine in terms of saving output which is still perfect but the -o flag is seemingly non functional. At least for me.

Hi @0m3rta13,

Thank you again for your thoughtful feedback and your patience with my late responses, your feedback it's greatly appreciated!

As far as the tor check goes your way of checking probably makes the most sense. It doesn't show up when checked with that command which makes sense as the service is owned by the other distro and the services aren't shared, only the sockets (I think that's how it works with WSL 2 at least).
...
For what it's worth though the way I check to see if tor is working or active across distros is by running
curl --socks5 localhost:9050 --socks5-hostname localhost:9050 -s https://check.torproject.org/ | cat | grep -m 1 Congratulations | xargs

I can understand the frustration with WSL2, as I'm using it also myself and I've found various issues here and there. You're definitely right though, about keeping up with the changes of WSL2. Personally, I'm starting tor.service directly from WLS2 and it seems that the script can find if the service is running, but, your way to check TOR service via a curl request it's actually a pretty good idea! I'll change the way of checking TOR from ps -e to a request to https://check.torproject.org/ soon, hopefully I'll find a suitable way for this, and it'll fix similar issues with shared services. Until then, a cheap way to bypass the ps -e check, would be to change the whole function checktor to just return true.

I've also found that the delay doesn't seem to work. No matter what I set for the -p flag it's always set to 2 seconds. I think I know why this is so will have a look another look the script when I get a chance later.

Regarding -p there was a typo in torcrawl.py, thanks for reported it, I've fixed it. That delay occurs when the script is jumping to the next "depth" of the links though, let me know if you'd prefer to have a delay between every request of the script.

Lastly, something I think would be very useful to add would be setting a user agent. This would remove the limitations when crawling some sites that sometimes seem to block crawl attempts. Haven't tried to implement this before myself but will have a go at it and if it works out I'll throw up a PR

That's also a great idea, and something that bugs me for a long time now, as it's a feature I was looking myself too. I'll start working on it -as it's a simple addition to the script- and I'll try to find a way also to have a --random-agent feature.

EDIT: Oh I forgot. Also the -o flag doesn't seem to do anything either. Granted it still creates a folder with the sites name and creates the files inside it so it works fine in terms of saving output which is still perfect but the -o flag is seemingly non functional. At least for me.

Well, I've added back in the initial development the -o argument to export in a file a single page. Eg. python torcrawl.py -v -w -u http://www.github.com/ -o git.htm seems to work. I know, it's not really the most useful parameter right now, but it's a quick alternative for python torcrawl.py -v -w -u http://www.github.com/ >> path/git.htm.

So, thank you again, shortly I'll push the changes! And I'll keep you updated when the new features will come!