This is V0.1 and it doesn’t quite work on the example.
FormScraper is a specialized web-scraping application for the particular case where you want to fill out and submit a form and then collect the results into a table.
If you are looking for something more general then you should probably
read the BeautifulSoup docs. Also, there might be a specific Python
package for your intended application. (Try pip3 search scrape
.)
It is built using Selenium, BeautifulSoup4, SQLAlchemy, and Pandas.
The Form Scraper contains the following components:
- A query generator that produces the set of all queries, and possibly allows you to break it down (e.g., year 2019, 2018, etc.)
- A selenium task that runs a collection of queries and processes the
result to return a record
- Pull up the query page using selenium (headless or in debug mode)
- Fill in the details and push go
- dictionary betweeen fields and selections, handling text, radio buttons, selections, and then, pushing the submit button.
- Parse the response into a record
- Table-based parser; map fields to reecord
- A database that stores the query and the reponse
- A manager that keeps track of open, closed, and dead queries, launches tasks, and stores results in a db. You can run multiple simultaneous jobs.
The code outputs its progress to the screen while updating a database. It has only been tested with SQLite databases.
When scraping, three tables are used. The inputs
table manages the
inputs required to do the scraping, with one record per input, and
assigns them each a unique integer id. The results
table tracks
the status of inputs as ‘not started’, ‘started’, ‘done’, and ‘error’.
For each table type of interest produced by the query, a table is
created in the database. For example, suppose a submitted form
returns a page with a table showing some author detail and then some
reviews from the paper. If you want both of them, you can specify
that in the config and name your tables for archiving the results.
You would then end up with an authors table and a reviews table in
your database that compiles results across all queries.
It is assumed that the table headers are the same in each response.
The code uses Firefox, so make sure that is installed. Also install geckodriver.
brew install geckodriver
pipenv --three # if needed
pipenv install # installs dependencies from the Pipfile
pipenv run python formscraper.py scan --url http://www.car-part.com/ --form-tag body
By default it will look for form
tags. Often, pages don’t bother to
use those tags, so you can pick a different one. Here we’ve chosen
body
.
If you want to see the page pop up in a browser, which is useful when
things go wrong, add the --debug
option.
The output prints to screen. You can indicate a file to write to with
the --output
option.
pipenv run python formscraper.py scan --url http://www.car-part.com/ \
--form-tag body --output example_form.yaml
Now the hard part: build the config file. Here is a summary of the fields it uses.
The most important and complicated part is probably the form_inputs
field which contains a list of html id’s and what values to put into
them. Not all form entries need to be specified if the site’s default
value is acceptable. To specify the values to enter, keep in mind
that we are specifying a range of values. At this time, the code only
supports a full Cartesian product of possiblilites. That is, each
value that varies will vary over all its values, despite what the
other values are. You can’t, for example, have it only choose years
up to 1986 if the make is Datsun, but up to the present for Ford.
To put in the same value all the time, the type
is set to ‘const’
and the string to enter is placed in value
. Note that numbers must
be entered as strings (in single-quotes). If there is more than one
option, then you can use type
set to ‘list’ and have value
assigned a list of values. (In YAML, this can be done with square
brackets, e.g., ['dog', 'cat']
or with multiple lines, each starting
with a dash and properly indented.) Two more options for type
are
‘all’ and ‘all-but’. With all
, the full list of options shown in
the forms data will be used. In all-but', it will be that full list
except the things listed under ~value
will be excluded.
It is unlikely that using the full list will work, since usually the default value is also shown as an option in the forms. Also, modifying the forms YAML file is an option, though it will trip a change detection test, making actual form changes undetectable, and print a warning with each input processed.
The other fields are:
url
is the url where the form is foundform_yaml
is the filename for the output from the scan commandinput_form_id
is the key for the form to use in the form filesubmit_with
is submit button id (check form yaml’s buttons key)output_db
is the SQLAlchemy connection string for the dboutput_table
determines which response tables to trackselect
is ‘by position’ or ‘by positions’which
is either an index (1-up) or a list of such indicestable_name(s)
is the DB table name(s) to write to, resp.
form_wait
andtable_wait
each have four possible optionsby
is either ‘class’ or ‘id’ and is what you are waiting forvalue
is the class name or id value to wait fordelay
is how long to wait before giving upthrottle
(optional) is an extra after-load wait time
All of these fields are required, except for the ...wait
fields.
If they are specified, the throttle
sub-fields are optional.
In the car-parts.com
example, we can construct the following
example.yaml
file, where the form data has been saved to
example_form.yaml
.
url: http://www.car-part.com
form_yaml: exaple_form.yaml
input_form_id: 1
form_inputs:
3:
type: const
value: 20901
year:
type: list
value:
- '2020'
- '2019'
- '2018'
model:
type: list
value:
- Chevy Bolt
- Chevy Volt
- Dodge Colt Vista
- Tesla S
4:
type: const
value: Radio/CD (see also A/C Control or TV Screen)
Loc:
type: const
value: Mid Atlantic
5:
type: const
value: Price
Of course, this doesn’t work.
pipenv run python formscraper.py scrape example.yaml
Doesn’t quite work.
If the result is an empty table, then there’s no problem, but if there’s no table, then it can throw an error. There’s an optional string to check for whose presence indicates there’s no tables; this will avoid the error.
These are hard to predict, but if they happen where expected, the a snapshot is taken and opened (in Mac OS) and the status is set to ‘error’ so that it will be skipped in the future.
If you lose internet access, then something may not load or send and it will crash gracelessly.
Other errors are likely due to assumptions about the web page that are wrong for a particular application. Generalization of the code would be difficult.