/examscrape

Primary LanguageJavaScript

Simple NodeJS / Python project for scraping examination material data from the SEC Exam Material Archive.

To clone the project, in your console type:

git clone https://github.com/sammce/examscrape.git
cd examscrape

Install dependencies using:

pnpm install
# or
yarn install
# or
npm install

Note: You must have selenium's Chrome driver installed and available on your systems PATH.

To run the scraper, use:

pnpm main
# or 
yarn main
# or
npm run main

This will compile the TypeScript file to JavaScript and run it. It may take a while for the scraper to finish, as there is an artificial delay between each dropdown selection to avoid issuing too many requests, resulting in a 429 HTTP error.

The format the resulting data (located in db.json) is in is unsuitable for the site, so a Python script is used to rearrange it.

To setup the Python environment, preferably in a venv, use:

pip install -r requirements.txt

Then run:

python rearrange.py
# Or on UNIX systems
python3 rearrange.py

This will only work if db.json is present in the current working directory.

The resulting CSV file (ideal_db.csv) is what will be used in the finished site.