
Repo containing scripts and Jupyter notebooks to reproduce the experiments in my PhD thesis, empirical chapter per empirical chapter.

Primary LanguageJupyter Notebook


Repo containing scripts and Jupyter notebooks to reproduce the experiments in my PhD thesis, empirical chapter per empirical chapter.


  • Chapter 3 : Dataset creation from Common Crawl
    • The code is made available within the crawls_nest repository. The repo contains 4 scripts:
      • A requirements.txt file
      • columnar_explorer.py collects all columnar files from a monthly Common Crawl data dump
      • process_warc_files.py extracts and collects utf-8 text and/or hrefs from each website within a dump
      • utils.py and utils_html.py contain utility functions indispensible for the scripts above
  • Chapter 5: Website scraping and Tomotopy modeling
  • Chapter 6: Website Weakly Supervised Classification
    • This directory contains the seedword.json file, which is a seed words list for each classification label
    • To implement contextualised weak supervision, we used the code in ConWea. A guide for implementation can be found in the repo.
  • Chapter 7: Inhomogeneous Ripley's K-function
    • The directory contains 4 files:
      • A requirements.txt file
      • simulate_controls.py runs a thinning algorithm to simulate companies with the same distribution as tangible and intangible ones
      • kinhom_estimation_tutorial.ipynb provides a step by step tutorial of our Python implementation of Ripleys Inhomogeneous K-function based on pointpats
      • kinhom_calcs.py calculates max and min of the Kinhom function on our simulated companies data