/UNSUPERVISED-LEARNING-PROJECT

The four project in the SDAIA academy T5 Data Science Bootcamp.

Primary LanguageJupyter Notebook

Arabic dialects

Introduction

This work that serves the Arabic language. By classifying Arabic dialects in social media through NLP and unsupervised machine learning algorithms. Our Model analyzes texts on social media then categories them into the major Arabic dialects (Nilotic, Gulf, Levantine, or Moroccan). The model reached 0.75 accuracy.

Design And Data Description

  • We worked on a public dataset from Kaggle
  • https://www.kaggle.com/ahmedessam21/arabic-dialect-identificationfreelancing/version/2?select=train.tsv
  • It contains 62,000 Arabic tweets, but it was not ready to use, during preprocessing we have cleaned data by removing “tashkeel”, removed repeated letters, correct spelling, simplify some writing ways, and remove stop words, etc.
  • Methodology:

  • Collecting data
  • Preprocessing
  • Vectorization
  • Topic modeling
  • Label Tweet
  • Exploratory Data Analysis
  • Prepare data for modeling
  • Classification
  • Test tweet

    Run app.py

    using below command to start Flask API:

    python app.py

    By default, flask will run on port 5000.

    You should be able to view the homepage.

    Enter URL and hit Predict. If everything goes well, you should be able to see the predicted Verification on the HTML page! check the output here: http://127.0.0.1:5000/predict .