Starter Kit: Web Scraping for Enterprise Characteristics

What is this Starter Kit intended for?

This Starter Kit is a deliverable of the WPC Enterprise characteristics which is part of the ESSnet Big Data II.

This Starter Kit is intended as an introduction to web scraping for enterprise characteristics. We hope that it will support producers of official statistics with implementing their own web scraping routines to derive enterprise characteristics. However, the methods and functions in this Starter Kit most likely need to be adapted to the individual needs and particularities in the respective countries.

Contents of the Starter Kit

The Starter Kit has (so far) three parts:

URLsFinder: A library to identify enterprise URLs with the search engine Duck Duck Go.
URLScraper: Functions to scrape a list of URLs and safe the HTML code in a NoSQL database for later analysis.
SocialMediaProfiles: A library to identify social media links on enterprise websites.

Each part of the Starter Kit has a Jupyter Notebook (file with .ipynb extension) that serves as a manual on how to use the functions and methods. These manuals are intended for statisticians with little to no programming background and can be viewed on this website. The source code can be consulted by users with a background in programming.

Setting up the environment

The Starter Kit is written for Python 3 (note: on Python 2 the applications will not work). We recommend to install Python with the Anaconda distribution. Anaconda comes with several pre-installed libraries that will be used in the Starter Kit. Occasionally, you will need to install additional libraries. Those will be mentioned in the respective part of the Starter Kit. Also, Anaconda distribution comes with pre-installed Jupyter Notebook software. For instructions on how to install Anaconda on your system, consult the Anaconda installation tutorials that are available for many operating systems.

You can install libraries with the following commands in Anaconda Prompt (or the command line tool of your choice):

conda install <library name>
OR
pip install <library name>

Substitute <library name> with the name of the library, for example for the library bs4: "pip install bs4".

Directory structure

The current folder has the following directories:

src - source code of the modules
- obec.py - Initialization code for the URLs Finder Stater Kit classes to be used with Jupyter Notebook.
- URLsFinderWS.py - defines methods for scraping information for the enterprises' urls from the internet with the help of search engine Duck Duck Go.
- URLsFinderMLLR.py - defines methods for determine the enterprises' urls from the scraped information from the internet by using logistic regersion machine learning technic.
- StarterKitLogging.py (optional to use) - defines methods for storing log records for the others modules work.
- SocialMediaPresenceCollector.py - Source code for finding social media links
- DomainScraper.py - Source code for the URLScraper
URLsFinder
- OBEC_Starter_Kit_URLs_Finder.ipynb - Manual on how to use the library for URL finding
- scrape_data - Destination folder for scraped data
- sbr_data - Source folder for statistical business register data used for scraping
- logs - Location of saved log files
- black_list_urls - Location for the blacklist of URLs that should be ignored by the URLs finder
- machine_learning - Results from machine learning predictions for URLs of enterprises
SocialMediaProfiles
- Starter-Kit_Social_Media_Profiles.ipynb - Manual on how to use the library for finding social media links
- url.txt - example data
URLScraper
- URLScraperApplication.ipynb - Manual on how to scrape URLs
- URLScraperApplication.py - Standalone application for scraping URLs to be run in the command line (does the same as the Jupyter Notebook)
- URLScraperLibrary.py - URL Scraper as library that can be imported
- url.txt - example data

Additional Resources

Here are some alternative softwares used by the ESS for several years that you may try:

urlfinding: Generic software for finding websites of enterprises using Google Search Engine and Machine Learning by Statistics Netherlands. Remark: only a 100 Google search queries per day are free.
SummaIstat: Software tools by Italian Statistics for Web Scraping for Enterprise Characteristics in Java and Solr. It uses the Bing Search Engine for finding websites of enterprises.