
An overarching pipeline for scraping and cleaning web data.


The scrapeAll project provides a convenient and modular solution for collecting raw data from websites and cleaning the obtained data. It consists of two main parts: collecting raw data and cleaning raw data.

1. Collect raw data from website/multiple urls:

2. Clean raw data:

Collect Raw Data

  1. Urls

    Required Packages:

    Example of how to use: This is a website that contains multiple urls

    cd Desktop\scrapeAll\
    python data/**电煤价格指数/raw/2016/
    Enter the url or the path to the json file that contains multiple urls:
    Enter the class of the div that contains the urls:
    Enter the search keyword (press enter to accept default value: None):
    Enter the XPATH of the next-page-button (press enter if not applicable):
    Enter the number of page to scrape (press enter to accept default value: 1):

  1. HTML Table

    Required Packages:

    How to use:

    1. open command prompt and cd to the folder that contains
    2. python {the folder that you want to store the image}
    3. paste or type the url
    4. type table

    Example: This is a website that contains a table

    cd Desktop\scrapeAll\
    python data/raw/2017年5月份**电煤价格指数/
    Enter the url or the path to the json file that contains multiple urls:

  1. Image

    Required Packages:

    How to use:

    1. open command prompt and cd to the folder that contains
    2. python {the folder that you want to store the image}
    3. paste or type the url
    4. type image
    5. type the threshold of the size of the image to download. (eg. to download images larger than 25kb type 25000)

    Example: This is a website that contains images

    cd Desktop\scrapeAll\
    python data/raw/关于2020年2月广东电力市场结算情况的通告/
    Enter the url or the path to the json file that contains multiple urls:
    Enter the minimum image size (in bytes) (press enter to accept default value: 15000 bytes):

  1. Pmos Pdf

    Required Packages:

    How to use:

    1. open command prompt and cd to the folder that contains
    2. python {the folder that you want to store the image}
    3. paste or type the pmos url
    4. type pmos
    5. type the html class of the pdf element. (press enter to accept default value el-table__row)
    6. type the number of page you want to scrape. (press enter to accept default value 1)
    7. type the search keyword. (press enter to accept default value None) (by giving a search keyword, you will only download pdf that contains such keyword.)

    Example: This is a PMOS website

    cd Desktop\scrapeAll\
    python python data/raw/Shandong_PMOS/
    Enter the url or the path to the json file that contains multiple urls:
    Enter the element class (press enter to accept default value: el-table__row):
    Enter the number of page to scrape (press enter to accept default value: 1):
    Enter the search keyword (press enter to accept default value: None)

    Note: This script will open a window that scrape each pdf from the PMOS website. It will also create a json file that store the url link of each pdf. The json file will help the script to recognize the pdf information that it has already collected, so that it will avoid re-scraping these pdfs.

Clean Raw Data

Required Packages: For text extraction: (attribute to Steven Zheng)

  • tesseract-ocr
  • Go to src/getClean/ and change the variable at line 14 to your tesseract path For image extraction:
  • pdf2image
  • From the above website, download the poppler for your laptop.
  • Go to src/getClean/ and change the variable at line 10 to your poppler bin path

  1. Extract Tables from PDF

    How to use:

    1. open command prompt and cd to the folder that contains
    2. python {the path to the pdf} {the path to the folder that you want to store the table} > Note: if {the path to the folder that you want to store the table} is omitted, then the path to the pdf is used with 'raw' replaced with 'clean'. (eg. if path to the pdf is data/raw/xxxx.pdf then the cleaned table would be stored in data/clean/xxxx/) > Note: Images of the pdf would be created and stored in a temporary folder in the same folder as the pdf.
    3. type pdf


    cd Desktop\scrapeAll\
    python data/raw/山东工作日报.pdf data/clean/山东工作日报/
    What type of data? image/pdf/table

    Note: If you want to convert a folder of pdfs, simply put the folder path as the input path. (eg. python data/folder_with_multiple_pdfs/ data/folder_to_store_clean_table)

  1. Extract Tables from Images

    How to use:

    1. open command prompt and cd to the folder that contains
    2. python {the path to the folder that contains images} {the path to the folder that you want to store the table} > Note: if {the path to the folder that you want to store the table} is omitted, then the path to the pdf is used with 'raw' replaced with 'clean'. (eg. if path to the folder that contains images is data/raw/xxxx/ then the cleaned table would be stored in data/clean/xxxx/)
    3. type image


    cd Desktop\scrapeAll\
    python data/raw/website_image data/clean/website_table/
    What type of data? image/pdf/table