Logo Extraction

This script extracts logo image from given website. Following assumptions are considered:

Logo image name contains 'logo' string
Logo image may contain one ore more words of the page's title
If an image does not contain 'logo' string but has some ancestors in a DOM tree which have id, class or data_original_src which contain 'logo' string, it can be a logo image.
The most relevant image is the one whose name contains the least number of symbols around 'logo' string
If an image navigates to any other website on click, it's not considered as logo
If screen size is 400x800, then logo image may not occupy more than 60000px area on screen

Results

This script finds 40 correct results out of 45 given samples. But some sites take too long for navigation(up to 10 minutes). So I put 40 sedonds timeout for navigation. This makes some sites fail with timeout and the total number of correct results is 37-38 out of 45.

Prerequisites

ruby 2.3.4
Google Chrome 64. (64-bit)
selenium-webdriver
rspec to run tests

Installation

Go to https://rvm.io/rvm/install and install rvm
rvm install 2.3.4 to install ruby
rvm gemset create b12 create separate gemset for the script
rvm gemset use b12
gem install bundler
Extract this b12.zip package and go to the directory
bundle --deployment install dependencies

Run

You need to have logo-extraction.txt files with the list of urls in the same format as in the sample file given. (With leading headers row just like in sample)

ruby ./scrape.rb to run the script.
rspec --format documentation to run the tests.

esen/logo_extraction

Logo Extraction

Results

Prerequisites

Installation

Run