Brazilian Legal Text Dataset for trainning transformer based models.
Before run, you have to install in your path a Firefox WebDriver for Selenium. Download last release at https://github.com/mozilla/geckodriver/releases Put executable file in PATH.
Run command below to install all required dependencies.
pip install -r requirements.txt
To generate a dataset for MLM pre-trainning.
Run the command below to execute all pipeline that will generate 2 files in output/mlm/
.
python mlm.py all
To run individual tasks, you can pass a task as parameter:
python mlm.py scrap
python run.py parse
python run.py export
To generate a dataset for STS fine-tunning.
Run the command below to execute all pipeline that will generate files in output/sts/{sts_type}/
.
python sts.py all --sts_type "binary | scale | triplet | benchmark"
If you are interested in downloading only the pre-generated datasets, just use the links below: