How can I extract images with this crawler?

Question

How can I extract images with this crawler?

Omid-r opened this issue 7 years ago · 14 comments

hello guys:.
I want extract image from the torcrawl
I can't well edite extract.py because I don't now about founction of this .py
please help me..

Answer 1 · 2017-05-27T13:34:22.000Z

Hello Omid-r!

Thanks for posting an issue for my script 👍 I'd be glad to help you with that!

As you can see on modules/crawler.py, I made a function excludes (line: 36) to exclude some type of links for further crawling. But for some usefull links (external links, telephones, mails) I'm making a .txt to save the URLs for other uses.

Is a .txt with image's URLs good for your case? Or are you looking to save the images on a folder?

Answer 2 · 2017-05-28T05:01:22.000Z

Thanks a lot Maike 👍
Yes I want save the images on a folder..
But I don't now how I can do that ..Please guidance me..
How I can use line:36 in main program : torcrawl.py or extractor.py?

Answer 3 · 2017-05-28T06:35:47.000Z

I found a quick workaround/start-point but I'm really not sure about the results:

Replace on modules/crawler.py lines 105-112 with this:

# For each <img src="">
outputfile = open(outpath + '/images.txt', 'w+')
for img in soup.findAll('img'):
  imglink = img.get('src')
  if imglink != None:
    canonicalimage = canonical(imglink, website)  
    outputfile.write(canonicalimage + '\n')
outputfile.close()

That part will create for you a list of images on website's folder.

So, probably a combine of TorCrawl and Wget will work for you:
$ python torcrawl.py -w -v -u github.com -c && wget -i www.github.com/images.txt
or
$ python torcrawl.py -v -u github.com -c && torsocks wget -i www.github.com/images.txt

Please, let me know if you figure out a better way

Answer 4 · 2017-05-28T07:16:52.000Z

👍 : I 'm geting a syntax error when I using that ..

Answer 5 · 2017-05-28T14:59:58.000Z

Oups, of course.. 🤕 We going to need an ':' on this loop!

I changed my previous comment, please check if that works better.

By the way, I'll try to find a better way to extract image's links.

Answer 6 · 2017-05-29T05:34:52.000Z

Hello Mike ..Thanks for your help my friend...
I fix this and It's runinig but I get this error..

Answer 7 · 2017-05-29T06:05:27.000Z

Hey Omid-r!

That's actually weird, because on crawler.py we call the module with the proper way (reference).

You can probably try the other way to call the module and change the following lines to:
modules/crawler.py(Line:8): import BeautifulSoup
modules/crawler.py(Line:93): soup = BeautifulSoup.BeautifulSoup(html_page)

Also, please check that you already have installed BeautifulSoup on python terminal with:
>>>import BeautifulSoup

Answer 8 · 2017-05-29T06:15:02.000Z

I fix that thanks how I can fix this?

Answer 9 · 2017-05-29T06:27:54.000Z

I'm glad to hear it worked well for you!

As I can see from your screenshot you're on MacBook, so it's normal that wget command didn't found.

There is on stackoverflow (here) a similar way to download the list of images:
for i in "`cat images.txt`" ; do curl -O "$i" ; done

EDIT: This answer (here) with xargs seems better:
cat images.txt | xargs -n 1 curl -O

Answer 10 · 2017-05-29T12:02:49.000Z

Omid-r commented 7 years ago

Answer 11 · 2017-05-31T05:55:31.000Z

Hello Mike
Please guide me
I handel errors..but now this's wierd..

Although Images.txt It was made but it's empty
And "No URLs found in www.github.com/images.txt."
I think something is wrong...

Answer 12 · 2017-06-06T07:25:11.000Z

Hey @Omid-r ! Sorry but I was busy last week.
For the case of Github (that's using sub-domain to host images), you can try this:

/modules/crawler.py@105:

# For each <img src="">
outputfile = open(outpath + '/images.txt', 'w+')
for img in soup.findAll('img'):
  imglink = img.get('src')
  if imglink != None: # Can't figure out why this occurs
    outputfile.write(imglink + '\n')
outputfile.close()

Please keep in mind this is a small work-around and not any final solution. Edit the code for your specific scenario.

Answer 13 · 2017-06-06T07:31:01.000Z

I'll close this question.

If you need any further help, leave me a comment here.
If you found out a successful way to crawl image's links, make a pull request and I'll be happy to review and merge it :)

Answer 14 · 2017-06-06T07:32:28.000Z

Ok Mike Thanks.

…

On Khordad 16, 1396 AP, at 12:01, Mike ***@***.***> wrote: I'll close this question. If you need any further help, leave me a comment here. If you found out a successful way to crawl image's links, make a pull request and I'll be happy to review and merge it :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AbNdua3mG9-qJqUjHLxualjsrzCp95Kvks5sBQA1gaJpZM4NoRKf>.