scrapeAll

An overarching pipeline for scraping and cleaning web data.

Overview

The scrapeAll project provides a convenient and modular solution for collecting raw data from websites and cleaning the obtained data. It consists of two main parts: collecting raw data and cleaning raw data.

1. Collect raw data from website/multiple urls:

Urls
HTML Table
Image
Pmos Pdf

2. Clean raw data:

Extract tables from pdf
Extract tables from images

Collect Raw Data

Urls

Required Packages:

Example of how to use: This is a website that contains multiple urls

cd Desktop\scrapeAll\
python getRaw.py data/**电煤价格指数/raw/2016/
Enter the url or the path to the json file that contains multiple urls:
https://www.cctd.com.cn/list-46-1.html
urls/table/image/pmos_pdf?
urls
Enter the class of the div that contains the urls:
new_list
Enter the search keyword (press enter to accept default value: None):
2016
Enter the XPATH of the next-page-button (press enter if not applicable):
//*[@id="pages"]/a[5]
Enter the number of page to scrape (press enter to accept default value: 1):
3

HTML Table

Required Packages:
- PIL
- beautifulsoup4
How to use:
1. open command prompt and cd to the folder that contains getRaw.py.
2. python getRaw.py {the folder that you want to store the image}
3. paste or type the url
4. type table
Example: This is a website that contains a table
```
cd Desktop\scrapeAll\
python getRaw.py data/raw/2017年5月份**电煤价格指数/
Enter the url or the path to the json file that contains multiple urls:
https://www.cctd.com.cn/show-46-167312-1.html
table/image/pdf/pmos?
table
```

Image

Required Packages:
- PIL
- beautifulsoup4
How to use:
1. open command prompt and cd to the folder that contains getRaw.py.
2. python getRaw.py {the folder that you want to store the image}
3. paste or type the url
4. type image
5. type the threshold of the size of the image to download. (eg. to download images larger than 25kb type 25000)
Example: This is a website that contains images
```
cd Desktop\scrapeAll\
python getRaw.py data/raw/关于2020年2月广东电力市场结算情况的通告/
Enter the url or the path to the json file that contains multiple urls:
https://zhuanlan.zhihu.com/p/124225606
table/image/pdf/pmos?
image
Enter the minimum image size (in bytes) (press enter to accept default value: 15000 bytes):
25000
```

Pmos Pdf

Required Packages:
- selenium
How to use:
1. open command prompt and cd to the folder that contains getRaw.py.
2. python getRaw.py {the folder that you want to store the image}
3. paste or type the pmos url
4. type pmos
5. type the html class of the pdf element. (press enter to accept default value el-table__row)
6. type the number of page you want to scrape. (press enter to accept default value 1)
7. type the search keyword. (press enter to accept default value None) (by giving a search keyword, you will only download pdf that contains such keyword.)
Example: This is a PMOS website
```
cd Desktop\scrapeAll\
python python getRaw.py data/raw/Shandong_PMOS/
Enter the url or the path to the json file that contains multiple urls:
https://pmos.sd.sgcc.com.cn/pxf-settlement-outnetpub/#/pxf-settlement-outnetpub/columnHomeLeftMenuNew
table/image/pdf/pmos?
pmos
Enter the element class (press enter to accept default value: el-table__row):

Enter the number of page to scrape (press enter to accept default value: 1):
2
Enter the search keyword (press enter to accept default value: None)
工作日报
```
Note: This script will open a window that scrape each pdf from the PMOS website. It will also create a json file that store the url link of each pdf. The json file will help the script to recognize the pdf information that it has already collected, so that it will avoid re-scraping these pdfs.

Clean Raw Data

Required Packages: For text extraction: (attribute to Steven Zheng)

tesseract-ocr
Go to src/getClean/image_to_text.py and change the variable at line 14 to your tesseract path For image extraction:
pdf2image
From the above website, download the poppler for your laptop.
Go to src/getClean/pdf_to_image.py and change the variable at line 10 to your poppler bin path

Extract Tables from PDF

How to use:
1. open command prompt and cd to the folder that contains getClean.py.
2. python getClean.py {the path to the pdf} {the path to the folder that you want to store the table} > Note: if {the path to the folder that you want to store the table} is omitted, then the path to the pdf is used with 'raw' replaced with 'clean'. (eg. if path to the pdf is data/raw/xxxx.pdf then the cleaned table would be stored in data/clean/xxxx/) > Note: Images of the pdf would be created and stored in a temporary folder in the same folder as the pdf.
3. type pdf
Example:
```
cd Desktop\scrapeAll\
python getClean.py data/raw/山东工作日报.pdf data/clean/山东工作日报/
What type of data? image/pdf/table
pdf
```
Note: If you want to convert a folder of pdfs, simply put the folder path as the input path. (eg. python getClean.py data/folder_with_multiple_pdfs/ data/folder_to_store_clean_table)

Extract Tables from Images

How to use:
1. open command prompt and cd to the folder that contains getClean.py.
2. python getClean.py {the path to the folder that contains images} {the path to the folder that you want to store the table} > Note: if {the path to the folder that you want to store the table} is omitted, then the path to the pdf is used with 'raw' replaced with 'clean'. (eg. if path to the folder that contains images is data/raw/xxxx/ then the cleaned table would be stored in data/clean/xxxx/)
3. type image
Example:
```
cd Desktop\scrapeAll\
python getClean.py data/raw/website_image data/clean/website_table/
What type of data? image/pdf/table
image
```

Angelinaaaaaaaaaaaa/scrapeAll

scrapeAll

Overview

Collect Raw Data

Urls

HTML Table

Image

Pmos Pdf

Clean Raw Data

Extract Tables from PDF

Extract Tables from Images