MikeMeliz/TorCrawl.py

How can I extract images with this crawler?

Omid-r opened this issue ยท 14 comments

hello guys:.
I want extract image from the torcrawl
I can't well edite extract.py because I don't now about founction of this .py
please help me..

Hello Omid-r!

Thanks for posting an issue for my script ๐Ÿ‘ I'd be glad to help you with that!

As you can see on modules/crawler.py, I made a function excludes (line: 36) to exclude some type of links for further crawling. But for some usefull links (external links, telephones, mails) I'm making a .txt to save the URLs for other uses.

Is a .txt with image's URLs good for your case? Or are you looking to save the images on a folder?

Thanks a lot Maike ๐Ÿ‘
Yes I want save the images on a folder..
But I don't now how I can do that ..Please guidance me..
How I can use line:36 in main program : torcrawl.py or extractor.py?

I found a quick workaround/start-point but I'm really not sure about the results:

Replace on modules/crawler.py lines 105-112 with this:

# For each <img src="">
outputfile = open(outpath + '/images.txt', 'w+')
for img in soup.findAll('img'):
  imglink = img.get('src')
  if imglink != None:
    canonicalimage = canonical(imglink, website)  
    outputfile.write(canonicalimage + '\n')
outputfile.close()

That part will create for you a list of images on website's folder.

So, probably a combine of TorCrawl and Wget will work for you:
$ python torcrawl.py -w -v -u github.com -c && wget -i www.github.com/images.txt
or
$ python torcrawl.py -v -u github.com -c && torsocks wget -i www.github.com/images.txt

Please, let me know if you figure out a better way

screen shot 1396-03-07 at 11 43 47

๐Ÿ‘ : I 'm geting a syntax error when I using that ..

screen shot 1396-03-07 at 11 46 12

Oups, of course.. ๐Ÿค• We going to need an ':' on this loop!

I changed my previous comment, please check if that works better.

By the way, I'll try to find a better way to extract image's links.

Hello Mike ..Thanks for your help my friend...
I fix this and It's runinig but I get this error..
screen shot 1396-03-08 at 10 01 51

Hey Omid-r!

That's actually weird, because on crawler.py we call the module with the proper way (reference).

You can probably try the other way to call the module and change the following lines to:
modules/crawler.py(Line:8): import BeautifulSoup
modules/crawler.py(Line:93): soup = BeautifulSoup.BeautifulSoup(html_page)

Also, please check that you already have installed BeautifulSoup on python terminal with:
>>>import BeautifulSoup

I fix that thanks how I can fix this?
screen shot 1396-03-08 at 10 44 12

I'm glad to hear it worked well for you!

As I can see from your screenshot you're on MacBook, so it's normal that wget command didn't found.

There is on stackoverflow (here) a similar way to download the list of images:
for i in "`cat images.txt`" ; do curl -O "$i" ; done

EDIT: This answer (here) with xargs seems better:
cat images.txt | xargs -n 1 curl -O

Hello Mike
Please guide me
I handel errors..but now this's wierd..

screen shot 1396-03-10 at 10 25 51

Although Images.txt It was made but it's empty
And "No URLs found in www.github.com/images.txt."
I think something is wrong...

Hey @Omid-r ! Sorry but I was busy last week.
For the case of Github (that's using sub-domain to host images), you can try this:

/modules/crawler.py@105:

# For each <img src="">
outputfile = open(outpath + '/images.txt', 'w+')
for img in soup.findAll('img'):
  imglink = img.get('src')
  if imglink != None: # Can't figure out why this occurs
    outputfile.write(imglink + '\n')
outputfile.close()

Please keep in mind this is a small work-around and not any final solution. Edit the code for your specific scenario.

I'll close this question.

  • If you need any further help, leave me a comment here.
  • If you found out a successful way to crawl image's links, make a pull request and I'll be happy to review and merge it :)