- Parsing of sites with table-like or tile-like structures
- Initial data preprocessing
Pulse-Selenium requires Python3.7+ and Selenium Chrome Driver to be installed.
Create virtual environment and install requirements.txt
./scripts/quotes.sh ./outputs/quotes local
or
./scripts/quotes.sh ./outputs/quotes remote
It can also be used through Docker
export APP=./scripts/quotes.sh
docker-compose up
Run http://localhost:8888/lab?token=webtric to access internal filesystem and read scraped files
Here is a good example on how to do it:
import pandas as pd
from os import listdir
from os.path import isfile, join
VOLUME = "/home/webtric"
files = [f for f in listdir(VOLUME) if isfile(join(VOLUME, f))]
print('List of all parsed files')
print('\n'.join(files))
df = pd.read_csv(join(VOLUME, files[-1]))
df.head()