Data-Preprocessing-on-Egyptian-dialect-text

image

we will go over the steps of Data Pre-Processing:


  1. Data Cleaning: The cleaning function removes all noises to delivers smooth Arabic text without impacting its meaning or content, such as:
    • Extra characters
    • Emojis
    • Non-Arabic characters
    • URLs
    • Any Punctuations

  1. Data Normalization: While data cleaning was used, we normalized the Egyptian text by:
    • Remove elongation which is repeated letters.
    • Correct the text by checking the spelling of Arabic sentences.
    • Remove Tashkeel of the characters.

  1. Data Visulaization: Visualize the text data using Arabic word cloud

image


The used packages and libraries:

  1. Regular expression https://docs.python.org/3/library/re.html
  2. Python ar-Corrector https://pypi.org/project/ar-corrector/
  3. Arabic word cloud https://amueller.github.io/word_cloud/auto_examples/arabic.html