/image_downloader

Lib for parsing web pages to find images in specified locations and downloading them simultaneously or sequentially. Picture downloader and grabber for images, photos, pictures (.jpg, .jpeg, .png, .gif, .ico, .svg, .bmp)...

Primary LanguageRubyMIT LicenseMIT

Image Downloader

Quite often there is a need to collect pictures from one or another page on the Internet. This plugin solves this particular task.

Installation

sudo gem install image_downloader

Requirements

  • ruby 1.8 or 1.9

  • gem nokogiri

Description

Image Downloader is a rather simple library which does the following:

  • get web page (with Net::HTTP)

  • parse html page (use regexp or nokogiri)

  • download images (in one or multi-threads)

Example usage

After installation, you can use the following code as an example:

require 'rubygems'
require 'image_downloader'

page_url = 'www.test.com'
target_path = 'img_dir/'
downloader = ImageDownloader::Process.new(page_url,target_path)

#####
# download all images on page in any place (by regexp, all that look like url with image)
downloader.parse(:any_looks_like_image => true)

##### or
# download images from all elements where usually images placed (<img...>, <a...>, ...)
downloader.parse()

##### or
# download image from exect places in page
downloader.parse(:collect => {:link_icon => true})

##### or
# download images by regexp
downloader.parse(:regexp => /[^'"]+\.jpg/i)

downloader.download()

For “parse” method available following options

# find all url which contain image extansion
:any_looks_like_image => true

# find images in specified location
:collect => {
  :all => true, # all image places
  :(img_src|a_href|style_url|link_icon) => true # specified location
}

# find by regexp
:regexp => /['"]([^'"]+\.jpg)[^'"]*['"]/i) # for ruby 1.8 (in 1.9 not allowed () for scan method)
:regexp => /[^'"]+\.jpg/i # the same, but shorter
:regexp => /[^'"]+\.css/  # other files can also be downloaded

# ignore URLs with images according to given parameters
:ignore_without => {:(extension|image_extension) => true}

# setting the favorite User-Agent (vary important for exclude 403, 404... responses from server)
:user_agent => "ruby" # Mozilla/5.0 by default

Detailed location description

  • img_src - tag: img, attribute: src=“url”

  • a_href - tag: a, attribute: href=“url”

  • style_url - tag: any, attribute: style=“(background|background-image): url(‘url’)”

  • link_icon - tag: link, attribute: rel=“shortcut icon” href=“url”

For “download” method you can use following directives

:parallel => true # for multi thread downloading (this is default if no options)
:consequentially => true, # for sequential downloading into a single stream
:user_agent => "ruby" # Mozilla/5.0 by default

Executables

You can simply use the executed shell commands:

For any looks like image download

download_any_images url dir/

For download favicon only

download_icon url dir/

For download all, that is located in the places for pictures

download_images url dir/

For download by regexp

download_by_regexp url dir/ "[^'\"]+\\.js"

Debugging

“-d”, “–debug”

To monitor the process of downloading, use the -d flag in the parameters. Perhaps there is an error URI::InvalidURIError in some cases.

download_images url dir/ -d

Copyright © 2011 Malykh Oleg. See LICENSE.txt for further details.

License

The MIT License

Authors

Personal blog author: Malykh Oleg - blog in russian