/Text_Analysis

Code for Textual Factor Framework in Cong, Liang and Zhang 2019

Primary LanguagePython

Textual Factor Analysis

This repository contains codes to implement the textual-factor framework developed in Cong, Liang and Zhang (2019).

The repository is organized in the following way:

  1. data folder contains the corpus of textual data that users wish to analyze.

  2. output folder contains the tokenized textual data that the code will generate.

  3. src folder contains the project's source codes of tokenization, clustering, and topic modeling.

Install

This project uses BeautifulSoup4, FALCONN and gensim. This version of the code also uses Google's pre-trained word2vec embeddings of words and phrases to generate clusters, so users should download them here and place the file in the directory of this project.

Usage and Relevant Output

To run textual-factor analysis:

$ python3 text_analysis.py

Relevant output includes:

  • google_cluster_50.txt contains word clusters generated by Algorithm 2 in Cong, Liang and Zhang (2019), i.e. hierarchical clustering with cluster size parameter equal to 50.

  • google_cluster50_processed.txt contains the final clusters the code uses after intersecting the generated clusters with the corpus vocabulary and removing repeated and stopped words.

  • 50document_loadings_google.csvcontains factors' beta loadings on the documents derived by Algorithm 4 in Cong, Liang and Zhang (2019), i.e. SVD.

  • top500_important_clusters_google50_avg.csv contains the 500 most important factors given by Algorithm 4 in Cong, Liang and Zhang (2019).

Update from previous version

  • This code is an abstracted version of the old code.

  • The tokenization and clustering are multithreaded.