Milliped is a flexible framework that helps browsing websites and extracting
information from pages automatically. It offers a customizable Browser class
to carry out three main activites:
- browsing: navigating from page to page and building a sequence of pages that contains information we want to extract.
- harvesting: downloading pages and store their HTML code.
- extracting: parse the HTML code of harvested pages, extract information from parsed data, then store results.
Separating browsing, harvesting, and extracting into distinct activities has several benefits: it is easier to enforce data immutability (data is not overwritten once produced), reproducibility (the output of browsing, harvesting, and extraction are stored separately), distribution of activities across processes or machines.
Milliped allows to customize its behavior at every step. For example, out-of-the-box functionality include downloading pages using basic HTTP requests, using Tor, or using Selenium to harvest data from dynamic pages.
Using pip is the easiest way to install
Millipede. If you are not sure you have pip installed, check by running:
pip3 -V
Verify the output shows a version of Python greater than or equal to 3.6.
If necessary, install pip by following the instructions from pip's
documentation.
Download the Milliped code repository, then navigate to the root folder of the repository and finally run:
pip3 install .
Milliped is a flexible framework with many features and options. To simplify
its use for first-timers, Milliped provides a SimpleBrowser class with
sensible defaults.
To get started with SimpleBrowser, you just need two things:
- a base URL: URL of a website you would like to crawl.
- a soup parser: function that extracts information from a
BeautifulSoupobject.
Choose a website that allows you to crawl it, you can search online the words
"scraping sandbox" if you need inspiration. Make sure you respect the rules
described in "terms of use" and the file robots.txt.
In this example, we are going to crawl the website books.toscrape.com and extract book titles and prices.
To extract information from web pages, we will use the function soup_parser
from the module examples.
def soup_parser(soup):
try:
result = {
"title": soup.find("h1").text,
# remove Sterling pound sign
"price": float(soup.find("p").text.replace("\u00a3", "")),
}
return result
except Exception:
pass
return {}
soup_parser tries to parse book title and price from a BeautifulSoup
object and returns a dictionary with keys named "title" and "price".
Open your favorite Python interpreter and run:
from milliped import examples
browser = examples.SimpleBrowser(
url="https://books.toscrape.com",
soup_parser=examples.soup_parser,
)
You are now ready to go. You can check log messages emitted during crawling by
watching the file browser.log in your current folder.
To start crawling the website, run:
browser.browse()
The browse method will traverse the website in a breadth-first search
fashion, by maintaining a queue of URLs parsed from each web page. In the
current SimpleBrowser configuration, every page queued for browsing is
also queued for harvesting, so these pages can be downloaded for actual
data extraction using the following command:
browser.harvest()
SimpleBrowser is configured to download the HTML code of every page
previously queued for harvesting and store it in compressed archives that
follow the naming pattern harvest_x.bz2 where x is the file number.
Archive size is capped at 100MB so once a file reaches this size a new
file is produced with an incremented file number.
Finally, we run the data extraction step:
browser.extract()
Extraction results are saved in the file extract.jsonl. which follows the
JSONLines specification.
The Browser class is defined in the browser.py module. Browser organizes
the three main activities that make up extracting information from a website:
- browsing: navigating from page to page and building a sequence of pages that contains information we want to extract).
- harvesting: downloading pages and store their HTML code.
- extracting: parse the HTML code of harvested pages, extract information from parsed data, then store results.
The behavior of the Browser class is defined at runtime by passing arguments
to its parameters. To customize Browser behavior, users do not have to
subclass Browser, but can simply use one of the functions or classes
available in the Milliped library (or write their own) and pass it as an
argument to Browser at runtime.
The Browser class can be instantiated with the following parameters:
base_url: URL of the website to browse (required).get_browsable: function which returns the URL of the next page to browse (optional, default returns all links).stop_browse: function to determine if we should stop browsing (optional, default is None).get_harvestable: function which returns the URL of the next page to harvest (optional, default returns all links).get_page_id: function which converts the URL into a unique ID, this is used when saving harvested pages (optional, default returns the MD5 hash).browse_queue: object to store a queue of web pages to browse (required).harvest_queue: object to store a queue of web pages to parse (required).download_manager: object to manage downloading web pages (required).html_parser: parser to use with BeautifulSoup, e.g. 'html.parser', 'lxml', etc. (optional, default is "html.parser").soup_parser: function to use to parse the HTML tags soup into a dictionary (required).harvest_store: object to manage storage of downloaded web pages (required).extract_store: object to manage storage of extracted data from parsed web pages (required).logger: configuredLoggerobject from the Python standard library logging module (optional, default logs with the levelINFOto the filebrowser.login the current working directory).
We will discuss these parameters in more detail in the following sections. A
few parameters are used across all Browser activities and will be detailed
here.
This is the URL of the website. If no initial URL is given as an argument
to the Browser.browse() method, browsing will start at base_url. Any
relative URL encountered by Browser will be prepended with base_url.
Built-in download manager objects are defined in the module
download_managers.py.
A download_manager object controls everything that deals with download of
web pages:
- Which pages are downloaded:
- Does the file robot.txt allows it?
- Does the page start with
base_url?
- How pages are downloaded:
- Retry strategy
- Headers
- Proxies
A download_manager object must have the following methods:
download(): takes a URL, downloads the web page, then returns a tuple (status code, response content).sleep(): pause for the amount of time defined byrequest_delay.
This is a Logger object from the Python standard library logging module.
This is configured by default with a file handler that appends messages with
level INFO to a file named browser.log in the current working directory.
The Milliped utils.py module provides the function get_logger that allows
to easily configure a logger and its handlers using a dictionary as follows:
import logging
from milliped.utils import get_logger
log_config = {
"handlers": [
{
"handler": "file",
"handler_kwargs": {
"filename": "browser.log",
"mode": "a",
},
"format": "%(asctime)s %(levelname)s %(message)s",
"level": logging.INFO,
},
]
}
logger = get_logger(__name__, **log_config)
Browsing is carried out by the Browser.browse() method. It starts from a URL
that is defined using the initial parameter, then traverses the website from
link to link using breadth-first search. If no initial URL is given, browsing
starts from the URL passed to the base_url parameter of the Browser class.
Child URLs are collected using the function passed to the Browser
get_browsable parameter. Child URLs are added to a queue object passed to
the Browser parameter browse_queue. This queue also keeps track of which
URLs it visited, to avoid visiting a page twice and falling into an
infinite browsing loop. Another way to control when browsing stops is the
Browser parameter stop_browse. This parameter expects a function that
inspects web page contents and decides if browsing should end or not.
In addition to the browse_queue, the browse() method builds a queue of URLs
for harvesting. This queue object is passed to the Browser parameter
harvest_queue. Like browse_queue, harvest_queue keeps track of which URLs
it visited to avoid harvesting a page twice.
This function takes a BeautifulSoup object as an argument and returns an
iterable sequence of URLs.
This function must take a BeautifulSoup object as an argument and return a
boolean that signal the browse() method to return if True. If False, the
browse() method continues.
This object stores objects in a first-in-first-out fashion. This queue must keep track of which objects it already saw in the past and avoid storing again an item it has already seen, even if this object has since been removed from the queue.
The browse_queue object must provide the following methods:
enqueue(): add an object to the queue if this object has never been added to the queue before.dequeue(): remove an object from the queue and return it.re_enqueue(): add an object to the queue even this object was previously added to the queue. This is useful to process again an object that failed to be processed.
browse_queue must provide the following attributes:
is_empty: returnsTrueif the queue has no elements, orFalseif the queue has one or more elements.
Harvesting is performed by the Browser.harvest() method. Harvesting
sequentially goes through URLs stored in harvest_queue, downloads web page
contents, and store them using the object passed to the Browser parameter
harvest_store. When harvested pages are stored, they are labeled with a
string given by the function get_page_id so they can be uniquely identified.
Harvested web pages will later be processed during the extract phase.
This function takes a BeautifulSoup object as an argument and returns an
iterable sequence of URLs.
This object stores objects in a first-in-first-out fashion. This queue must keep track of which objects it already saw in the past and avoid storing again an item it has already seen, even if this object has since been removed from the queue.
harvest_queue must provide the following methods:
enqueue(): add an object to the queue if this object has never been added to the queue before.dequeue(): remove an object from the queue and return it.re_enqueue(): add an object to the queue even this object was previously added to the queue. This is useful to process again an object that failed to be processed.
The harvest_queue object must provide the following attributes:
is_empty: returnsTrueif the queue has no elements, orFalseif the queue has one or more elements.
This is a function which takes the URL of a page and returns a unique identifier that can be used as a file name, a primary key value, etc.
This object manages storage of harvested web pages code. This object must have the following methods:
put(): this takes the unique identifier generated byget_page_idand contents from a web page, then stores them for later extraction.get(): this method takes no arguments and returns a tuple made of a web page identifier and contents.__len__(): theharvest_storeobject must be callable by the Python builtin functionlen().
Extraction is carried out by the Browser.extract() method. This method
retrieves web page contents from the harvest_store object, generates a
BeautifulSoup object from web page contents, extracts information from
the BeautifulSoup object into a dictionary, then stores extracted information
using the Browser.extract_store object.
This is a string that is passed as the features parameter of the
BeautifulSoup
object.
By default, html_parser is set to "html.parser".
This function is responsible for extracting information from the
BeautifulSoup object made from web page contents. It takes a BeautifulSoup
object as input and returns a dictionary.
This object manages storage of extracted data. It must have have the following methods:
write(): takes a dictionary or sequence (list or tuple) of dictionaries containing extracted data then durably stores data.
Built-in download manager objects are defined in the module
download_managers.py.
A download manager controls everything that deals with download of web pages:
- Which pages are downloaded:
- Does the file robot.txt allows it?
- Does the page start with
base_url?
- How pages are downloaded:
- Retry strategy
- Headers
- Proxies
A download manager object must have the following methods:
download(): takes a URL string, downloads the web page, then returns a tuple (status code, response content). The status code must be an integer and the response content must be bytes.sleep(): pause for the amount of time defined byrequest_delay.
The class SimpleDownloadManager is built on top of the Python library
requests and provides basic
functionality to download web pages. Its constructor has the following
parameters:
base_url: URL of the website to browse (required).max_retries: maximum number of times a request for a web page is made before failing (optional, default 10).backoff_factor: exponential backoff factor (optional, default 0.3).retry_on: HTTP status code of responses that lead to a request retry (optional, default (500, 502, 503, 504)).headers: request headers (optional, default None).proxies: list of proxies (optional, default None).timeout: timeout for requests in seconds (optional, default 3).request_delay: time in seconds to wait between requests. This value will be searched in robots.txt but will default to the user-defined value (optional, default 1).logger: logger object from the Python standard libraryloggingmodule (optional, default logs messages to the filebrowser.login the current working directory).
The class TorDownloadManager is built on top of
tor and the the Python library
stem and provides basic
functionality to download web pages. Refer to the documentation to install and
configure theses libraries for your operating system.
TorDownloadManager's constructor has the following parameters:
base_url: URL of the website to browse (required).max_retries: maximum number of times a request for a web page is made before failing (optional, default 10).backoff_factor: exponential backoff factor (optional, default 0.3).retry_on: HTTP status code of responses that lead to a request retry (optional, default (500, 502, 503, 504)).headers: request headers (optional, default None).proxies: list of proxies (optional, default None).timeout: timeout for requests in seconds (optional, default 3).request_delay: time in seconds to wait between requests. This value will be searched in robots.txt but will default to the user-defined value (optional, default 1).max_requests: number of requests made before the Tor relay is rotated (optional, default 50).tor_password: password to authenticate with the Tor client (required).logger: logger object from the Python standard libraryloggingmodule (optional, default logs messages to the filebrowser.login the current working directory).
The class FirefoxDownloadManager is built on top of the
Selenium client for Python and
Geckodriver, the WebDriver for
Firefox. It allows extraction of data from page contents generated by
Javascript.
Refer to the documentation to install and configure Selenium and Geckodriver.
FirefoxDownloadManager's constructor has the following parameters:
base_url: URL of the website to browse (required).max_retries: maximum number of times a request for a web page is made before failing (optional, default 10).wait_page_load: number of seconds to wait for page contents to load, some pages rich in Javascript contents take a few seconds to completely load (optional, default 20).headers: request headers (optional, default None).proxies: list of proxies (optional, default None).driver_path: path to the Geckodriver executable.options: list of Geckodriver options, such as "headless", "window-size=1420,1080", etc.webdriver_log_path: path to the web driver log file.request_delay: time in seconds to wait between requests. This value will be searched in robots.txt but will default to the user-defined value (optional, default 1).logger: logger object from the Python standard libraryloggingmodule (optional, default logs messages to the filebrowser.login the current working directory).
The Browser class uses two queues to keep track of web pages during browsing
and harvesting. These queues serve as the basis of the breadth-first search
traversal of the website.
Built-in queues are defined in the module queues.py.
A queue object must provide the following methods:
enqueue(): takes an item and stores it in the queue if it has not been previously enqueued. The item must be of a hashable type.re_enqueue(): takes an item and stores it in the queue even if it has been previously enqueued.dequeue(): removes an item from the queue and returns it.__len__(): returns the number of items remaining in the queue.__contains__(): takes an item and returnsTrueif the item is currently present in the queue, otherwiseFalse.
A queue object must provide the following attributes:
is_empty: returnTrueif there is at least one item in the queue, otherwiseFalse.
To ensure that queued items are added only once, built-in queues store items in a set as well as in a queue, this is why items need to be hashable.
The class LocalQueue uses collections.deque to store items, and the
built-in set class to help store unique items.
LocalQueue has the following parameters:
name: name of the queue, will be displayed in log messages.logger: logger object from the Python standard libraryloggingmodule (optional, default logs messages to the filebrowser.login the current working directory).
The class SQSQueue uses an AWS standard SQS queue to store items, and a
DynamoDB table to help store unique items. By using an SQS queue, it is
possible to run browsing and harvesting in parallel.
The following AWS resources must be configured before instantiating this class:
- an AWS standard queue
- a DynamoDB table with a partition key of type string and name "id"
- an IAM role to send and receive SQS messages, and get queue attributes, as well as put an item into the DynamoDB table
- the environment variables AWS_PROFILE and AWS_REGION configured to use the IAM role described above
SQS queues have a massive throughput, and no specific configuration is necessary to let browsing scale massively. On the other hand, DynamoDB throughput needs precise configuration. There are two read/write capacity modes: on demand and provisioned. You may use the on demand mode when you are setting up a new workload, and take advantage of an easier configuration, but watch costs carefully. Provisioned mode is cheaper but requests will be throttled if you go beyond the provisioned throughput. Read the documentation for more details.