Repo containing scripts and Jupyter notebooks to reproduce the experiments in my PhD thesis, empirical chapter per empirical chapter.
Contents:
- Chapter 3 : Dataset creation from Common Crawl
- The code is made available within the crawls_nest repository. The repo contains 4 scripts:
- A
requirements.txt
file -
columnar_explorer.py
collects all columnar files from a monthly Common Crawl data dump process_warc_files.py
extracts and collects utf-8 text and/or hrefs from each website within a dumputils.py
andutils_html.py
contain utility functions indispensible for the scripts above
- A
- The code is made available within the crawls_nest repository. The repo contains 4 scripts:
- Chapter 5: Website scraping and Tomotopy modeling
- Chapter 6: Website Weakly Supervised Classification
- This directory contains the
seedword.json
file, which is a seed words list for each classification label - To implement contextualised weak supervision, we used the code in ConWea. A guide for implementation can be found in the repo.
- This directory contains the
- Chapter 7: Inhomogeneous Ripley's K-function
- The directory contains 4 files:
- A
requirements.txt
file -
simulate_controls.py
runs a thinning algorithm to simulate companies with the same distribution as tangible and intangible ones kinhom_estimation_tutorial.ipynb
provides a step by step tutorial of our Python implementation of Ripleys Inhomogeneous K-function based on pointpatskinhom_calcs.py
calculates max and min of the Kinhom function on our simulated companies data
- A
- The directory contains 4 files: