This repository contains the Python code and results of the IRAP web scraping exercise. File descriptions are as follows
-
1.irap_scrape
extracts a collection of organization websites using a variety of Wikipedia "source" pages. The results are saved inoutput/candidate_organizations.csv
. -
2.extract_webinfo.py
visits the websites gathered in the first step and extracts information from the front page. Results are saved inoutput/candidate_organizations_tosearch.csv
. -
3.filter.py
filters the list of scraped candidates based on the presence of keywords on the front page. Results are saved inoutput/preliminary_scraped_candidates.csv
. -
4.manual_process.py
makes some manual adjustments to the dataset using information frommanual_additions.csv
. The final results are saved asoutput/scraped_candidates.csv
. -
5a.word_doc_extract.py
extracts all hyperlinks from thePotential Certifiers - Advanced Manufacturing.docx
document, visits those links, extracts text from the front page, and saves them asoutput/worddocorgs_tosearch.csv
. -
5b.filter.py
searches for the list of keywords and saves the results as `output/worddoc_candidates.csv".