SeqAL

SeqAL is a sequence labeling active learning framework based on Flair.

Installation

SeqAL is available on PyPI:

pip install seqal

SeqAL officially supports Python 3.8+.

Usage

To understand what SeqAL can do, we first introduce the pool-based active learning cycle.

Step 0: Prepare seed data (a small number of labeled data used for training)
Step 1: Train the model with seed data
- Step 2: Predict unlabeled data with the trained model
- Step 3: Query informative samples based on predictions
- Step 4: Annotator (Oracle) annotate the selected samples
- Step 5: Input the new labeled samples to labeled dataset
- Step 6: Retrain model
Repeat step2~step6 until the f1 score of the model beyond the threshold or annotation budget is no left

SeqAL can cover all steps except step 0 and step 4. Because there is no 3rd part annotation tool, we can run below script to simulate the active learning cycle.

python examples/run_al_cycle.py \
  --text_column 0 \
  --tag_column 1 \
  --data_folder ./data/sample_bio \
  --train_file train_seed.txt \
  --dev_file dev.txt \
  --test_file test.txt \
  --pool_file labeled_data_pool.txt \
  --tag_type ner \
  --hidden_size 256 \
  --embeddings glove \
  --use_rnn False \
  --max_epochs 1 \
  --mini_batch_size 32 \
  --learning_rate 0.1 \
  --sampler MaxNormLogProbSampler \
  --query_number 2 \
  --token_based False \
  --iterations 5 \
  --research_mode True

Parameters:

Dataset setup: -text_column: which column is text in CoNLL format. -tag_column: which column is tag in CoNLL format. -data_folder: the folder path of data. -train_file: file name of training (seed) data. -dev_file: file name of validation data. -test_file: file name of test data. -pool_file: file name of unlabeled data pool data.
Tagger (model) setup: -tag_type: tag type of sequence, "ner", "pos" etc. -hidden_size: hidden size of model -embeddings: embedding type -use_rnn: if true, use Bi-LSTM CRF model, else CRF model.
Training setup -max_epochs: number of epochs in each round for active learning. -mini_batch_size: batch size. -learning_rate: learning rate.
Active learning setup -sampler: sampling method, "LeastConfidenceSampler", "MaxNormLogProbSampler", etc. -query_number: number of data to query in each round. -token_based: if true, count data number as token based, else count data number on sentence based. -iterations: number of active learning round. -research_mode: if true, simulate the active learning cycle with real labels, else with predicted labels.

More explanations about the parameters in the tutorials.

You can find the script in examples/run_al_cycle.py or examples/active_learning_cycle_research_mode.py. If you want to connect SeqAL with an annotation tool, you can see the script in examples/active_learning_cycle_annotation_mode.py.

Tutorials

We provide a set of quick tutorials to get you started with the library.

Performance

Active learning algorithms achieve 97% performance of the best deep model trained on full data using only 30% of the training data on the CoNLL 2003 English dataset. The CPU model can decrease the time cost greatly only sacrificing a little performance.

See performance for more detail about performance and time cost.

Contributing

If you have suggestions for how SeqAL could be improved, or want to report a bug, open an issue! We'd love all and any contributions.

For more, check out the Contributing Guide.