/thesis_repo

Repo containing scripts and Jupyter notebooks to reproduce the experiments in my PhD thesis, empirical chapter per empirical chapter.

Primary LanguageJupyter Notebook

thesis_repo

Repo containing scripts and Jupyter notebooks to reproduce the experiments in my PhD thesis, empirical chapter per empirical chapter.

Contents:

  • Chapter 3 : Dataset creation from Common Crawl
    • The code is made available within the crawls_nest repository. The repo contains 4 scripts:
      • A requirements.txt file
      • columnar_explorer.py collects all columnar files from a monthly Common Crawl data dump
      • process_warc_files.py extracts and collects utf-8 text and/or hrefs from each website within a dump
      • utils.py and utils_html.py contain utility functions indispensible for the scripts above
  • Chapter 5: Website scraping and Tomotopy modeling
  • Chapter 6: Website Weakly Supervised Classification
    • This directory contains the seedword.json file, which is a seed words list for each classification label
    • To implement contextualised weak supervision, we used the code in ConWea. A guide for implementation can be found in the repo.
  • Chapter 7: Inhomogeneous Ripley's K-function
    • The directory contains 4 files:
      • A requirements.txt file
      • simulate_controls.py runs a thinning algorithm to simulate companies with the same distribution as tangible and intangible ones
      • kinhom_estimation_tutorial.ipynb provides a step by step tutorial of our Python implementation of Ripleys Inhomogeneous K-function based on pointpats
      • kinhom_calcs.py calculates max and min of the Kinhom function on our simulated companies data