JurisBERT - Brazilian Legal Text Dataset

Brazilian Legal Text Dataset for trainning transformer based models.

Requeriments

Before run, you have to install in your path a Firefox WebDriver for Selenium. Download last release at https://github.com/mozilla/geckodriver/releases Put executable file in PATH.

Run command below to install all required dependencies.

pip install -r requirements.txt

To generate a dataset for MLM pre-trainning. Run the command below to execute all pipeline that will generate 2 files in output/mlm/.

python mlm.py all

To run individual tasks, you can pass a task as parameter:

python mlm.py scrap
python run.py parse
python run.py export

To generate a dataset for STS fine-tunning. Run the command below to execute all pipeline that will generate files in output/sts/{sts_type}/.

python sts.py all --sts_type "binary | scale | triplet | benchmark"

If you are interested in downloading only the pre-generated datasets, just use the links below: