Code Repository for the Paper: Zero-Shot Taxonomy Induction using Representation Learning: An Empirical Study
Retrofitting is implemented based on this paper: Retrofitting Word Vectors to Semantic Lexicons
To create the training data from the taxonomy, use the following command,
python3 createTrainingData.py --src_taxonomy=<input_taxonomy> --out_file=<training_data>.pkl
where,
<input_taxonomy> = Input taxonomy file path. Format of taxonomy file should be similar to the Google Product Taxonomy as found here: https://www.google.com/basepages/producttype/taxonomy.en-US.txt
<training_data> = Generated training data which is used for training the super-type classifier model
To train the super-type classifier model, use the following command,
python3 trainClassifier.py --input_data=<training_data>.pkl --model_type=<model_type>
--embeddings_path=<embeddings_path> --embeddings_dims=<embeddings_dims> --vector_features=<feature_type>
--out_classifier_model=<output_model_path>.pkl
where,
<training_data> = Training data created from taxonomy
<model_type> = Classifier model to use (Valid options are SVM, Logistic, LogisticCV, RandomForest). Random forest works best
<embeddings_path>, <embeddings_dims> = Embeddings to use (Should be in Word2Vec format) and its dimensions
<feature_type> = Feature type to use (Valid options are vector_concat, vector_diff).
vector_concat concatenates the word embeddings of the node pairs.
vector_diff computes the difference between the word embeddings of the node pairs.
<output_model_path> = Output model path to store the trained model
To generate the ranks file for the methods (Pre-trained Embeddings (method 1), Domain-specific Corpus-trained Embeddings (method 2), Super-type Classification using Pre-trained Embeddings (method 4), Super-type Classification using Retrofitted Embeddings (method 5)), use the following command,
python3 generateRanks.py --lexicon_to_evaluate=<ground_truth_lexicon> --embeddings_path=<embeddings_path>
--embeddings_dims=<embeddings_dims> --classifier_model_path=<super_type_classifier_model_path>.pkl
--out_ranks_file=<output_ranks_file_path>.jl
To generate the ranks file for the method (Retrofitted Embedding Transfer with Super-Type Classification (method 5)), use the following command,
python3 generateRanksBestMethod.py --lexicon_to_evaluate=<ground_truth_lexicon> --embeddings_path=<embeddings_path>
--embeddings_dims=<embeddings_dims> --classifier_model_path=<super_type_classifier_model_path>.pkl
--out_ranks_file=<output_ranks_file_path>.jl
where,
<ground_truth_lexicon> = Ground truth lexicon
<embeddings_path>, <embeddings_dims> = Embeddings to use (Should be in Word2Vec format) and its dimensions
<super_type_classifier_model_path> = Trained super type classifier model (optional argument needed only for methods 3, 4 and 5)
<output_ranks_file_path> = Output path to store the generated ranks file (needed for computing the MAP and nDCG metrics in the Rank metrics module)
To compute the MAP and nDCG metric, use the following command,
python3 computeRankMetrics.py --ground_truth=<ground_truth_lexicon> --baseline_ranks_file=<baseline_ranks>.jl
--comparison_ranks_file=<comparison_ranks>.jl
where,
<ground_truth_lexicon> = Ground Truth lexicon
<baseline_ranks> = Baseline Ranks file
<comparison_ranks> = Comparison Ranks file
To generate the MST Tree from the ranks file, use the following command,
python3 computeMST.py --ranks_file=<input_ranks_file> --max_num_edges=<max_num_edges> --weighted=<weight_condition>
--out_mst_file=<output_mst_file>
where,
<input_ranks_file> = Input ranks file (generated in the "Ranking the neighbors" module)
<max_num_edges> = Maximum number of edges considered for creating the lexicon graph (default: 50)
<weight_condition> = Whether to use weights in the graph when computing the Maximum Spanning Tree (MST) from the graph.
<output_mst_file> = Generated MST file (stored as weighted edge list representation as in NetworkX package)
To compute the approximate tree similarity metrics, use the following command,
python3 computeApproxTreeSimilarity.py --baseline_mst_file=<baseline_mst_tree> --comparison_mst_file=<comparison_mst_tree>
--ground_truth=<ground_truth_lexicon> --gt_sample_nodes_set=<sample_nodes_set>
where,
<baseline_mst_tree> = Baseline MST tree (Should be in weighted edge list representation as in NetworkX package)
<comparison_mst_tree> = Comparision MST tree (Should be in weighted edge list representation as in NetworkX package)
<ground_truth_lexicon> = Ground Truth lexicon
<sample_nodes_set> = Sample nodes set from the Ground Truth tree which are used as source nodes in the computation of Jaccard Similarity metric on the shortest paths.
To compute the Tree Edit distance metric, use the following command,
python3 computeTED.py --save_plots=<save_condition> --plot_dir=<plot_directory> --mst_file=<mst_tree>
--ground_truth=<ground_truth_lexicon>
where,
<save_condition> = Whether to save the visualization of the Ground truth tree and Maximum Spanning Tree (MST) as image files. (True or False)
<plot_directory> = Director to save the plots
<mst_tree> = MST tree (Should be in weighted edge list representation as in NetworkX package)
<ground_truth_lexicon> = Ground Truth lexicon
Note that it is not feasible to compute the Tree Edit Distance metric for a large tree (with number of nodes >=1000) because of its time complexity.
To compute the various statistics about the taxonomy, use the following command,
python3 computeTaxonomyStats.py --input_taxonomy=<input_taxonomy>
To generate the lexicon from the taxonomy tree, use the following command,
python3 generateGroundTruth.py --src_taxonomy=<input_taxonomy> --out_lexicon_file=<output_lexicon>
where,
<input_taxonomy> = Input taxonomy file path. Format of taxonomy file should be similar to the Google Product Taxonomy as found here: https://www.google.com/basepages/producttype/taxonomy.en-US.txt
<output_lexicon> = Output lexicon generated based on the input taxonomy tree