Code for reproducing Bag-of-Words vs. Graph vs. Sequence in Text Classification: Questioning the Necessity of Text-Graphs and the Surprising Strength of a Wide MLP.
This code was adapted from the official code released by the original work's authors.
Download the data
folder from the Text-GCN repository and place the folder in the main directory.
conda create --name text_clf
conda activate text_clf
pip install -r requirements.txt
Download glove.42B.300d.zip and place it inside a folder called glove
.
Run experiments for all datasets by replacing the dataset name below [20ng, R8, R52, ohsumed, mr].
DATASET="20ng"
python run_text_classification.py --model_type mlp --tokenizer_name "bert-base-uncased" --bow_aggregation "tfidf" --mlp_num_layers 1 --batch_size 16 --learning_rate "0.001" --epochs 100 --num_workers 4 --results_file "results/results_tfidf_mlp.csv" "$DATASET"
python run_text_classification.py --model_type mlp --model_name_or_path "glove/glove.42B.300d.txt" --batch_size 16 --learning_rate "0.001" --mlp_num_layers 2 --mlp_embedding_dropout "0.0" --epochs 100 --num_workers 4 --results_file "results/results_glove42b_mlp.csv" "$DATASET"
python run_text_classification.py --model_type mlp --model_name_or_path "glove/glove.42B.300d.txt" --batch_size 16 --learning_rate "0.001" --mlp_num_layers 3 --mlp_embedding_dropout "0.0" --epochs 100 --num_workers 4 --results_file "results/results_glove42b_mlp_2.csv" "$DATASET"
python run_text_classification.py --model_type mlp --tokenizer_name "bert-base-uncased" --batch_size 16 --learning_rate "0.001" --epochs 100 --num_workers 4 --results_file "results/results_mlp.csv" "$DATASET"
python run_text_classification.py --model_type mlp --tokenizer_name "bert-base-uncased" --batch_size 16 --learning_rate "0.001" --epochs 100 --mlp_num_layers 2 --num_workers 4 --results_file "results/results_mlp_2.csv" "$DATASET"
python run_text_classification.py --model_type distilbert --model_name_or_path "distilbert-base-uncased" --batch_size 8 --learning_rate "0.00005" --gradient_accumulation_steps 4 --epochs 10 --num_workers 4 --results_file "results/results_distilbert_10epochs.csv" "$DATASET"
python run_text_classification.py --model_type bert --model_name_or_path "bert-base-uncased" --batch_size 8 --learning_rate "0.00005" --gradient_accumulation_steps 4 --epochs 10 --num_workers 4 --results_file "results/results_bert_10epochs_1.csv" "$DATASET"
python run_text_classification.py --model_type bert --model_name_or_path "bert-base-uncased" --batch_size 8 --learning_rate "0.00005" --gradient_accumulation_steps 4 --epochs 10 --ignore_position_ids --num_workers 4 --results_file "results/results_bert_10epochs.csv" "$DATASET"
python run_text_classification.py --model_type bert --model_name_or_path "bert-base-uncased" --batch_size 8 --learning_rate "0.00005" --gradient_accumulation_steps 4 --epochs 10 --shuffle_augment 0.2 --num_workers 4 --results_file "results/results_bert_10epochs.csv" "$DATASET"