- Writing a sample tokenizer Code and testing it, Using a provided sample code on GitHub and Google Colab.
Steps:-
- Installing
arabert
andtransformers
modules. - Using
from transformers import AutoTokenizer, AutoModel
to import the tokenizer and the model builder. - Using
from arabert.preprocess import ArabertPreprocessor
to import the text preprocessing tool. - Calling the Model
model_name = "aubmindlab/bert-base-arabertv2"
. - Testing the tokenizer and the preprocessor:-
- Tested with Different forms of Arabic text:
-
العربية الفصحى
-
الْعَرَبِيَّةِ الْفُصْحَىUsing Shakkala.
- Egyptian Arabic text.
-