Outrider provides tools and structure for managing multiple web scraping and data analysis tasks. It provides a command-line interface to a powerful API, useful for collecting, cleaning, storing and analysing data from the web.
Outrider lets you create multiple projects, each relating to a specific domain. Once set up, you use simple commands to run specific tasks for specific projects.
- Download Outrider using the command line
git clone https://github.com/deadlysyntax/outrider.git
and change into the new directory in your terminal. - Install dependencies
bundle install
. - Setup the database by importing ./config/schema.sql into MySQL -
mysql -u username -p database_name < ./config/schema.sql
- Create the database config file in your home directory
~/.outrider/config/database.yml
for storing environment-based database configuration.
host: localhost
username: root
password: root
database: outrider
adapter: mysql2
- Create a new project
rake project:build['project_name','http://domain.com']
- giving it a name and the url of the domain you want to work with. This sets up the project in the database and generates a new folder in ./projects. It creates a file called auxiliary.rb within the new project folder. - In each project's auxiliary.rb file is where we define our functionality specific to that new project (for example, to define what content to pull from each page on a domain).
- Run
./lib/ignite.rb command -p project_name
to initiate each task.
###Auxiliary file
As mentioned above, when you create a new project, Outrider creates a new auxiliary.rb file within the project's folder. This file is where you create the functions that map to the command line interface. If you type ./lib/ignite.rb crawl -p project_name
, Outrider will look for a crawl
method inside auxiliary.rb and run that function. Within those auxiliary functions, you have access to the full Outrider API. The API provides a range of common functionality.
# creates a new project file
rake project:build['nz_herald','http://nzherald.com']
# /projects/nz_herald/auxiliary.rb
# crawl method maps to the first argument passed to the command line
def crawl options
# This method is part of the Outrider API, which you have access to within your auxiliary.rb file
# @config passes our specific project data to the method
# the second argument is a callback which recieves the Nokogiri page object for every age that it visits. http://www.nokogiri.org/tutorials/searching_a_xml_html_document.html
OutriderTools::Crawl::site( @config, ->(page, uri){
# Select elements from the page using css selectors
unless( page.css('.articleTitle').text.strip.empty? )
# Do any data processing from the page
clean_date = DateTime.strptime(page.css('.storyDate').text.strip, '%a %b %d %H:%M:%S %Z %Y').to_s #Tue Mar 03 08:27:23 UTC 2015
# Set the fields that get stored in the database for each page visited
return {
:title_raw => page.css('.articleTitle').text.strip,
:author => page.css('.authorName a').text.strip,
:content_raw => page.css('#articleBody p').map{ |paragraph| paragraph.text.strip }.to_json,
:date_published_raw => page.css('.storyDate').text.strip,
:date_published_timestamp => clean_date,
:status => 'scraped'
}
else
return {
:status => 'rejected'
}
end
})
end
Outrider's goal is to provide a conventional way to create data scraping projects and run tasks for each project, and to make it so that once a task is created, it's easy to run regularly. Your interface for running tasks is the command line. Here I will describe the anatomy of an Outrider command.
A command is made of several parts:
./lib/ignite.rb crawl -p project_name
- Script:
./lib/ignite.rb
is the core file which processes our commands. This is the script we run. - Command:
crawl
is the task to run. This part of the command is mapped to a function you define within auxiliary.rb. - Project:
-p project_name
specifies which project to run. Outrider will look to the functions within the specified project's auxiliary.rb file.
Any functions that you'd like to run via the CLI must be defined in both commandify.rb
and your project's auxiliary.rb
. Outrider comes with some predefined commands set up, which you can write functions for in auxiliary.rb, but you can also extend Outrider and define your own
# in ./lib/outrider/commandify.rb
module Commandify
def self.process
# Place custom command options here. See instructions at http://manageiq.github.io/trollop/
sub_commands << %w()
# Set these to accept arguments through the command line and pass them to your auxiliary methods
command_opts = Trollop::options do
# REQUIRED. Do not mess with the default options.
# Do not duplicate arguments or their short form.
# Run tests after modifying
# opt :domain, "The domain", :short => "-d", :type => String, :default => ''
############################################
# CUSTOM. Place custom command options here
#
end
end
end
Any commands listed in the sub_commands
variable can then be defined within any of your projects' auxiliary.rb
file and run by specifying it in the CLI call.
So if you were to add a sentiment
command to the sub_commands
variable in ./lib/commandify.rb
, then add a sentiment
method to your projects auxiliary file, you could then run that function by calling ./lib/ignite.rb sentiment -p project_name
.
# in ./lib/project/:project_name/auxiliary.rb
def my_own_command( options )
# Here you have access to the OutriderTools module
# Also, options is a hash of the commands passed in through CLI
end
Commands are handled by a gem called Trollop
IMPORTANT!
- Do not modify the existing Trollop configuration
- Read the Trollop documentation at http://manageiq.github.io/trollop/
- Put your own configuration in the specified places - see below.
- Always run tests
rspec spec
after modifying this - TODO - move this functionality into an setup where they don't have to touch commandify.rb
./lib/ignite.rb my_own_command -p project_name -your_argument_key value
Once set up in commandify.rb, calling your new method in auxiliary.rb will pass in a hash of the options specified in the command line call. This means your auxiliary methods need to accept the options hash.
def auxiliary_method( options )
# options contains a hash such as { :project => 'project_name' }
end
A call to ./ignite.rb crawl -p test_project
will
- Check the validity of the command against the API definition in ./lib/outrider/commandify.rb,
- In this case crawl is a legitimate API method, and since that passes it will then
- Look in
./lib/project/test_project/auxiliary.rb
for a public method called crawl. - If it doesn't find it there, it will look in the global
Project
object - It will call crawl and pass in all the options specified in the command line (as long as they're set up in commandify)
Within the auxiliary function you define, you have access to a powerful API that provides functionality and tools that are common to the purposes of Outriders main goals - such as crawling and scraping.
OutriderTools::Crawl::site( project, each_page_callback )
Argument | Expected Value | Description |
---|---|---|
project | { :id => 0, :title => '', :domain => '' } | A hash of project config values |
each_page_callback | ->( page, uri ){} | A callback function to run, which gets passed the Nokogiri object and URI for each page |
Recursively looks to the ProjectData in the database for the first status: 'unscraped'
data record for the specified project. While at each page, it will run the callback and pass it the Nokogiri::HTML object and the current URI. It builds a list of sanitized urls and adds them as ProjectData rows in the database. Thus, recursively filtering through an entire domain and acting on each page
http://www.rubydoc.info/github/sparklemotion/nokogiri
OutriderTools::Scrape::page( url, operate )
Argument | Expected Value | Description |
---|---|---|
url | "http://domain.com" | A url to scrape |
operate | ->( page, uri ){} | A callback function to run, which gets passed the Nokogiri object and URI for each page |
Will go to the URL and run the callback and pass it the Nokogiri::HTML object and the current URI.
http://www.rubydoc.info/github/sparklemotion/nokogiri
When working with Outrider, you first create and then work within a project. You can create as many of these as you want. These let us create custom functionality to handle different jobs uniquely.
Projects can be created using the command line
# Creates ./lib/projects/:project_name/auxiliary.rb
#and adds a record to the projects table in the database,
# including a seed entry to the raw_data table
./lib/ignite.rb create_project -p project_name
# Just adds a record to the projects table of the database
#and the seed entry in the raw_data table (doesn't create an auxialiary file)
./lib/ignite.rb create_project_db_row -p project_name -d http://domain.com
# Deletes the project folder and row in the project table of the database
./lib/ignite.rb delete_project -p project_name
Outrider is assumed to run on two machines - a dev and a production server. When you create a new project you do so on the dev server and it gets added to git, however in order to make a project runnable in production, there must be some entries in the database for that. Outrider has a rake command that is run from the dev server which creates the files and db entry on the dev server and also creates the necessary database entries on the production server.
rake project:build['project_name','project_domain']
# in project root ./
rspec spec