This repository contains the code for the hands-on session of the paper talk on building foundation models using BERT.
- Fork and clone the repository
- Create a virtual environment using conda (recommended) or venv and activate it. Then, install the dependencies using
the following command:
conda create -n bert python=3.10 conda activate bert pip install -r requirements.txt
- Set up HuggingFace code repository
- Visit HuggingFace's website and create an account. Remember your username!
- Navigate to Access Tokens and generate a token with write access.
- Copy your username and token to
src/train_tokenizer.py
(L13, L15) andsrc/train_model.py
(L10, L62)
- Training on your data (optional) - a data directory consisting of exported chat history from the
PESU26
,PESU27
andTech Talks
groups is provided in thedata
directory. If you wish to train on your own data, follow the steps below:- Export your chat history from multiple WhatsApp groups. You can do this by opening the chat, clicking on the
three dots in the top right corner and selecting
More > Export chat
. SelectWithout Media
and export the chat history. Store all the.txt
files in a directory. - You need to now convert the
.txt
files into.json
files which the trainer can use. To do this, run the following command:Based on the datetime format used by your phone, you might need to modify the regex stored in thepython src/whatsapp.py \ --input <path to directory containing .txt files> \ --output <path to output directory>
datetime_format
variable in thesrc/whatsapp.py
file. If you are unsure about the regex, you can use regex101 to test it out. If you are unaware of how to write regex, feel free to reach out to me. - You can now train the tokenizer on your data by running the following command. Modify L7
in
src/train_tokenizer.py
to add the path to the newly generatedmessages.json
file and then run the script.After running it, you should verify the two things:python src/train_tokenizer.py
- A new directory called
tokenizer
has been created in the current working directory. - Visit
https://huggingface.co/<your_username>/tokenizer
. You should be able to see this repository that holds your new tokenizer - Make sure to also push the changes to your forked repository.
- A new directory called
- Export your chat history from multiple WhatsApp groups. You can do this by opening the chat, clicking on the
three dots in the top right corner and selecting
- Setup ngrok
This will be covered in the session. If you wish to train the model beforehand, follow the steps below:
- Push any changes to your forked repository
- Visit this Colab notebook and carry out the steps mentioned in the notebook. Make sure to choose a GPU runtime, run it cell by cell and replace the values of the tokens where mentioned.
- The notebook takes ~6 hours to run per epoch. It will automatically save the progress after every epoch and upload the model to your HuggingFace repository.
This will be covered in the session. If you wish to run inference beforehand, follow the steps below:
- Push any changes to your forked repository
- Visit this Colab notebook and carry out the steps mentioned in the notebook. Make sure to choose a GPU runtime, run it cell by cell and replace the values of the tokens where mentioned.
If you did not try training your own model, it is highly recommended you do so! You can also try out some other fine-tuning tasks like:
- Given a message, predict the group, sender and recipient
- Perform clustering on the messages and find similar ones