thesis_repo

Repo containing scripts and Jupyter notebooks to reproduce the experiments in my PhD thesis, empirical chapter per empirical chapter.

Contents:

Chapter 3 : Dataset creation from Common Crawl
- The code is made available within the crawls_nest repository. The repo contains 4 scripts:
  - A requirements.txt file
  - columnar_explorer.py collects all columnar files from a monthly Common Crawl data dump
  - process_warc_files.py extracts and collects utf-8 text and/or hrefs from each website within a dump
  - utils.py and utils_html.py contain utility functions indispensible for the scripts above
Chapter 5: Website scraping and Tomotopy modeling
Chapter 6: Website Weakly Supervised Classification
- This directory contains the seedword.json file, which is a seed words list for each classification label
- To implement contextualised weak supervision, we used the code in ConWea. A guide for implementation can be found in the repo.
Chapter 7: Inhomogeneous Ripley's K-function
- The directory contains 4 files:
  - A requirements.txt file
  - simulate_controls.py runs a thinning algorithm to simulate companies with the same distribution as tangible and intangible ones
  - kinhom_estimation_tutorial.ipynb provides a step by step tutorial of our Python implementation of Ripleys Inhomogeneous K-function based on pointpats
  - kinhom_calcs.py calculates max and min of the Kinhom function on our simulated companies data

giuliaok/thesis_repo