/ArabBert_Tokenizer

Used “aubmindlab/bert-base-arabertv2” from Aub-mind AraBERT to create a simple Arabic text tokenizer.

Primary LanguageJupyter Notebook

ArabBert_Tokenizer

  • ArabBERT_Tokenizer: Open In Colab

Goal:-

  • Writing a sample tokenizer Code and testing it, Using a provided sample code on GitHub and Google Colab.

Steps:-

  1. Installing arabert and transformers modules.
  2. Using from transformers import AutoTokenizer, AutoModel to import the tokenizer and the model builder.
  3. Using from arabert.preprocess import ArabertPreprocessor to import the text preprocessing tool.
  4. Calling the Model model_name = "aubmindlab/bert-base-arabertv2".
  5. Testing the tokenizer and the preprocessor:-
  • Tested with Different forms of Arabic text:
    • العربية الفصحى
    • الْعَرَبِيَّةِ الْفُصْحَى
      Using Shakkala.
    • Egyptian Arabic text.