This repository contains codes to implement the textual-factor framework developed in Cong, Liang and Zhang (2019).
The repository is organized in the following way:
-
data
folder contains the corpus of textual data that users wish to analyze. -
output
folder contains the tokenized textual data that the code will generate. -
src
folder contains the project's source codes of tokenization, clustering, and topic modeling.
This project uses BeautifulSoup4, FALCONN and gensim. This version of the code also uses Google's pre-trained word2vec embeddings of words and phrases to generate clusters, so users should download them here and place the file in the directory of this project.
To run textual-factor analysis:
$ python3 text_analysis.py
Relevant output includes:
-
google_cluster_50.txt
contains word clusters generated by Algorithm 2 in Cong, Liang and Zhang (2019), i.e. hierarchical clustering with cluster size parameter equal to 50. -
google_cluster50_processed.txt
contains the final clusters the code uses after intersecting the generated clusters with the corpus vocabulary and removing repeated and stopped words. -
50document_loadings_google.csv
contains factors' beta loadings on the documents derived by Algorithm 4 in Cong, Liang and Zhang (2019), i.e. SVD. -
top500_important_clusters_google50_avg.csv
contains the 500 most important factors given by Algorithm 4 in Cong, Liang and Zhang (2019).
-
This code is an abstracted version of the old code.
-
The tokenization and clustering are multithreaded.