Danach ist das In-der-Welt-sein ein Sich-vorweg-schon sein-in-der Welt als Sein-bei-innerweltlich-begegnendem-Seienden
Martin Heidegger
Language as a fundamental characteristic of man and society is the center of NLP. It has the potential of great enlightenment, as well as great concealment. Language and thinking must be brought into harmony.
Simplification of language leads to the democratization of knowledge. Thus, it can provide access to knowledge that may otherwise be hidden. No more complex language!
Deep Martin aims to contribute to this.
The project is dedicated to different models to make complicated and complex content accessible to all.
It follows the approach of Simple Wikipedia.
Two different approaches are available. One is to use the super nice Hugging Face library. This can be used to create various state-of-the-art sequence to sequence models. The other part is a self-made transformer. Here it is mainly about trying out different approaches.
For using the Hugging Face implementation you need to provide a dataset. It needs to have one column with the normal version (Normal
)
and one for the simplified version (Simple
).
The HuggingFaceDataset
class can help you with it.
To train
a model you then simply run something like:
python /your/path/to/deep-martin/src/hf_transformer_trainer.py \ --eval_steps 5000 \ # This number should be based on the size of the dataset. --warmup_steps 800 \ # This number should be based on the size of the dataset. --ds_path /path/ \ # Path to you dataset. --save_model_path /path/ \ # Path to where the trained model should be stored. --training_output_path /path/ \ # Path to where the checkpoints and the training data should be stored. --tokenizer_id bert-base-cased # Path or identifier to Hugging Face tokenizer.
There are a lot more parameters. Check out hf_transformer_trainer.py
to get an overview.
This transformer is more for experimenting. Have a look at the code and get an overview of what is going on.
To train the self-made-transformer, a train and a test dataset as CSV is needed. This will be transformed
to a suitable dataset at the beginning of the training. Same as with the transformers from above the dataset needs to have one column with the normal version (Normal
)
and one for the simplified version (Simple
)
To start the training you can run:
python /your/path/to/deep-martin/src/custom_transformer_trainer.py \ --ds_path /path \ # Path of the folder which contains the `train_file.csv` and the `test_file.csv` --train_file train_file.csv \ --test_file test_file.csv \ --epochs 3 \ --save_model_path /path/ # Path to where the trained model should be stored.
Let's talk about the problems in this project.
As so often, one problem lies in obtaining high-quality data.
Multiple datasets were used for this project. You can find them
here.
While the ASSET dataset provides a very good quality due to the multiple simplification of each record, its size is simply too small for training a transformer.
This problem is also true for other datasets.
The two datasets based on Wikipedia unfortunately suffer from
lack of quality. Either a record is not a simplification,
but simply the same article. Or the simplification is of poor quality. In both cases, using it meant worse results.
To increase the overall quality, the records were compared and
filtered out using Doc2Vec and cosine distance.
Transformers are huge, need a lot of data and a lot of time to train. Google colab can help, but it is not the most convenient way. With the help of AWS EC2, things can be sped up a lot and, training of larger models is also possible.
Since the self-made-transformer is a work-in-progress project, it is never finished. It is made for learning and trying out. One interesting idea is to use the transformer as a generator in a GAN to improve the overall output.