/thistle

Primary LanguageRust

Big Data and ML - Spring 2021

Brad Windsor (bw1879), Kevin Choi (kc2296)

Thistle

  1. Final project proposal

  2. One needs to download the pretrained BERT model from https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/ and perform a conversion into a Rust-compatible form first. Run the following commands:

mkdir -p models/bert-base-nli-stsb-mean-tokens

wget -P models https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/bert-base-nli-stsb-mean-tokens.zip

unzip models/bert-base-nli-stsb-mean-tokens.zip -d models/bert-base-nli-stsb-mean-tokens

python3 -m venv thistle-env

source thistle-env/bin/activate

pip install torch

export PWD=`pwd`

python3 utils/convert_model.py $PWD/models/bert-base-nli-stsb-mean-tokens/0_BERT/pytorch_model.bin
  1. Modifying Rust. This project uses some features of Rust that are not yet on the stable build. To use the nightly build, set:
rustup toolchain install nightly

rustup default nightly
  1. Integration testing
cargo test

Running MS MARCO dataset

Data preparation

mkdir data
cd data
wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz
tar -xf triples.train.small.tar.gz
export SIZE=10000 # or any other size
head -n $SIZE triples.train.small.tsv > data.tsv
LC_ALL=C tr -dc '\0-\177' <data.tsv >data_cleaned.tsv
cd ..

To run:

# see run_eval.rs
cargo run > output100.txt

References