An overarching pipeline for scraping and cleaning web data.
The scrapeAll project provides a convenient and modular solution for collecting raw data from websites and cleaning the obtained data. It consists of two main parts: collecting raw data and cleaning raw data.
1. Collect raw data from website/multiple urls:
2. Clean raw data:
-
Required Packages:
Example of how to use: This is a website that contains multiple urls
cd Desktop\scrapeAll\ python getRaw.py data/**电煤价格指数/raw/2016/ Enter the url or the path to the json file that contains multiple urls: https://www.cctd.com.cn/list-46-1.html urls/table/image/pmos_pdf? urls Enter the class of the div that contains the urls: new_list Enter the search keyword (press enter to accept default value: None): 2016 Enter the XPATH of the next-page-button (press enter if not applicable): //*[@id="pages"]/a[5] Enter the number of page to scrape (press enter to accept default value: 1): 3
-
Required Packages:
How to use:
- open command prompt and cd to the folder that contains getRaw.py.
python getRaw.py {the folder that you want to store the image}
- paste or type the url
- type
table
Example: This is a website that contains a table
cd Desktop\scrapeAll\ python getRaw.py data/raw/2017年5月份**电煤价格指数/ Enter the url or the path to the json file that contains multiple urls: https://www.cctd.com.cn/show-46-167312-1.html table/image/pdf/pmos? table
-
Required Packages:
How to use:
- open command prompt and cd to the folder that contains getRaw.py.
python getRaw.py {the folder that you want to store the image}
- paste or type the url
- type
image
- type the threshold of the size of the image to download. (eg. to download images larger than 25kb type 25000)
Example: This is a website that contains images
cd Desktop\scrapeAll\ python getRaw.py data/raw/关于2020年2月广东电力市场结算情况的通告/ Enter the url or the path to the json file that contains multiple urls: https://zhuanlan.zhihu.com/p/124225606 table/image/pdf/pmos? image Enter the minimum image size (in bytes) (press enter to accept default value: 15000 bytes): 25000
-
Required Packages:
How to use:
- open command prompt and cd to the folder that contains getRaw.py.
python getRaw.py {the folder that you want to store the image}
- paste or type the pmos url
- type
pmos
- type the html class of the pdf element. (press enter to accept default value
el-table__row
) - type the number of page you want to scrape. (press enter to accept default value 1)
- type the search keyword. (press enter to accept default value None) (by giving a search keyword, you will only download pdf that contains
such keyword.)
Example: This is a PMOS website
cd Desktop\scrapeAll\ python python getRaw.py data/raw/Shandong_PMOS/ Enter the url or the path to the json file that contains multiple urls: https://pmos.sd.sgcc.com.cn/pxf-settlement-outnetpub/#/pxf-settlement-outnetpub/columnHomeLeftMenuNew table/image/pdf/pmos? pmos Enter the element class (press enter to accept default value: el-table__row): Enter the number of page to scrape (press enter to accept default value: 1): 2 Enter the search keyword (press enter to accept default value: None) 工作日报
Note: This script will open a window that scrape each pdf from the PMOS website. It will also create a json file that store the url link of each pdf. The json file will help the script to recognize the pdf information that it has already collected, so that it will avoid re-scraping these pdfs.
Required Packages: For text extraction: (attribute to Steven Zheng)
- tesseract-ocr
- Go to src/getClean/image_to_text.py and change the variable at line 14 to your tesseract path For image extraction:
- pdf2image
- From the above website, download the poppler for your laptop.
- Go to src/getClean/pdf_to_image.py and change the variable at line 10 to your poppler bin path
-
How to use:
- open command prompt and cd to the folder that contains getClean.py.
python getClean.py {the path to the pdf} {the path to the folder that you want to store the table}
> Note: if {the path to the folder that you want to store the table} is omitted, then the path to the pdf is used with 'raw' replaced with 'clean'. (eg. if path to the pdf is data/raw/xxxx.pdf then the cleaned table would be stored in data/clean/xxxx/) > Note: Images of the pdf would be created and stored in a temporary folder in the same folder as the pdf.- type
pdf
Example:
cd Desktop\scrapeAll\ python getClean.py data/raw/山东工作日报.pdf data/clean/山东工作日报/ What type of data? image/pdf/table pdf
Note: If you want to convert a folder of pdfs, simply put the folder path as the input path. (eg. python getClean.py data/folder_with_multiple_pdfs/ data/folder_to_store_clean_table)
-
How to use:
- open command prompt and cd to the folder that contains getClean.py.
python getClean.py {the path to the folder that contains images} {the path to the folder that you want to store the table}
> Note: if {the path to the folder that you want to store the table} is omitted, then the path to the pdf is used with 'raw' replaced with 'clean'. (eg. if path to the folder that contains images is data/raw/xxxx/ then the cleaned table would be stored in data/clean/xxxx/)- type
image
Example:
cd Desktop\scrapeAll\ python getClean.py data/raw/website_image data/clean/website_table/ What type of data? image/pdf/table image