Time-boxed project on research article classification
Report under report/main.pdf
All code under src
(extreme) multi-label / hierarchical / multi-level text classification
Arxiv dataset: (HF_datasets)[https://huggingface.co/datasets/arxiv_dataset/blob/main/arxiv_dataset.py]
: hyperannotated with commentary
Baseline: BERT
- Better encoder: DeBERTa
- Better pre-training: SciBERT
- Better loss: Two-way Multi-Label Loss
Alternate approaches:
- Few-shot classification (Setfit)
- Generative models for instruction tuning (Llama2)
- accuracy
- precision
- recall
- F1
- Hamming loss
See wandb report @ https://wandb.ai/jordy-vlan/scientific-text-classification
To reproduce the results of the report, one can run the commands of the models in wandb. For example, to reproduce the results of the ((current) best reported model)[https://wandb.ai/jordy-vlan/scientific-text-classification/runs/zk81z3qc/overview?workspace=user-jordy-vlan], one can run the following command:
python src/baseline_multilabel.py --experiment_name SciBERT_twowayloss_25K_bs64 --model_name_or_path allenai/scibert_scivocab_uncased --output_dir ../results --seed 42 --evaluation_strategy steps --per_device_train_batch_size 64 --gradient_accumulation_steps 1 --learning_rate 2e-5 --num_train_epochs 1 --max_steps 25000 --logging_strategy steps --logging_steps 0.05 --save_steps 0.2 --eval_steps 0.2 --criterion TwoWayLoss --Tp 4.0 --Tn 1.0
The scripts require python >= 3.8 to run and a conda environment with the following packages:
conda env create -f environment.yml # creates the environment
conda activate aapd # activates the environment
