Masader

The first online catalogue for Arabic NLP datasets. This catalogue contains more than 600 datasets with more than 25 metadata annotations for each dataset added by more than 40 contributors. You can view the list of all datasets using the link of the webiste https://arbml.github.io/masader/

Title Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Authors Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb, Maged S. Al-shaibani
> https://arxiv.org/abs/2110.06744

Abstract: The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create \textit{Masader}, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.*

Metadata

No. dataset number
Name name of the dataset
Subsets subsets of the datasets
Link direct link to the dataset or instructions on how to download it
License license of the dataset
Year year of the publishing the dataset/paper
Language ar or multilingual
Dialect region Levant, country ar-EGY: (Arabic (Egypt)) or type Modern Standard Arabic
Domain social media, news articles, reviews, commentary, books, transcribed audio or other
Form text, audio or sign language
Collection style crawling, crawling and annotation (translation), crawling and annotation (other), machine translation, human translation, human curation or other
Description short statement describing the dataset
Volume the size of the dataset in numbers
Unit unit of the volume, could be tokens, sentences, documents, MB, GB, TB, hours or other
Provider company or university providing the dataset
Related Datasets any datasets that is related in terms of content to the dataset
Paper Title title of the paper
Paper Link direct link to the paper pdf
Script writing system either Arab, Latn, Arab-Latn or other
Tokenized whether the dataset is segmented using morphology: Yes or No
Host the host website for the data i.e GitHub
Access the data is either free, upon-request or with-fee.
Cost cost of the data is with-fee.
Test split does the data contain test split: Yes or No
Tasks the tasks included in the dataset spearated by comma
Evaluation Set the data included in the evaluation suit by BigScience
Venue Title the venue title i.e ACL
Citations the number of citations
Venue Type conference, workshop, journal or preprint
Venue Name full name of the venue i.e Associations of computation linguistics
authors list of the paper authors separated by comma
affiliations list of the paper authors' affiliations separated by comma
abstract abstract of the paper
Added by name of the person who added the entry
Notes any extra notes on the dataset

Access Data

You can access the annoated dataset using datasets

from datasets import load_dataset
masader = load_dataset('arbml/masader')
masader['train'][0]

which gives the following output

{'Abstract': 'Modern Standard Arabic (MSA) is the official language used in education and media across the Arab world both in writing and formal speech. However, in daily communication several dialects depending on the country, region as well as other social factors, are used. With the emergence of social media, the dialectal amount of data on the Internet have increased and the NLP tools that support MSA are not well-suited to process this data due to the difference between the dialects and MSA. In this paper, we construct the Shami corpus, the first Levantine Dialect Corpus (SDC) covering data from the four dialects spoken in Palestine, Jordan, Lebanon and Syria. We also describe rules for pre-processing without affecting the meaning so that it is processable by NLP tools. We choose Dialect Identification as the task to evaluate SDC and compare it with two other corpora. In this respect, experiments are conducted using different parameters based on n-gram models and Naive Bayes classifiers. SDC is larger than the existing corpora in terms of size, words and vocabularies. In addition, we use the performance on the Language Identification task to exemplify the similarities and differences in the individual dialects.',
 'Access': 'Free',
 'Added By': '',
 'Affiliations': ',The Islamic University of Gaza,,',
 'Authors': 'Chatrine Qwaider,Motaz Saad,S. Chatzikyriakidis,Simon Dobnik',
 'Citations': '25.0',
 'Collection Style': 'crawling,annotation',
 'Cost': '',
 'Derived From': '',
 'Description': 'the first Levantine Dialect Corpus (SDC) covering data from the four dialects spoken in Palestine, Jordan, Lebanon and Syria.',
 'Dialect': 'Levant',
 'Domain': 'social media',
 'Ethical Risks': 'Medium',
 'Form': 'text',
 'Host': 'GitHub',
 'Language': 'ar',
 'License': 'Apache-2.0',
 'Link': 'https://github.com/GU-CLASP/shami-corpus',
 'Name': 'Shami',
 'Paper Link': 'https://aclanthology.org/L18-1576.pdf',
 'Paper Title': 'Shami: A Corpus of Levantine Arabic Dialects',
 'Provider': 'Multiple institutions ',
 'Script': 'Arab',
 'Subsets': [{'Dialect': 'Jordan',
   'Name': 'Jordanian',
   'Unit': 'sentences',
   'Volume': '32,078'},
  {'Dialect': 'ar-PS: (Arabic (Palestinian Territories))',
   'Name': 'Palestanian',
   'Unit': 'sentences',
   'Volume': '21,264'},
  {'Dialect': 'Syria',
   'Name': 'Syrian',
   'Unit': 'sentences',
   'Volume': '48,159'},
  {'Dialect': 'Lebanon',
   'Name': 'Lebanese',
   'Unit': 'sentences',
   'Volume': '16,304'}],
 'Tasks': 'dialect identification',
 'Test Split': 'No',
 'Tokenized': 'No',
 'Unit': 'sentences',
 'Venue Name': 'International Conference on Language Resources and Evaluation',
 'Venue Title': 'LREC',
 'Venue Type': 'conference',
 'Volume': '117,805',
 'Year': 2018}

Running Masader locally with Jekyll

Prerequisites:

Install Ruby.
Install bundle.
Install Jekyll.

Steps:

Open the project in the terminal
Run bundle install to install dependencies.
Run the site locally with bundle exec jekyll serve.
Preview Masader site on your web browser by navigate to http://127.0.0.1:4000/masader/.

Note: Navigate to the publishing source for MASADER site. For more information about publishing sources, see.

Web Service

Masader depends on a set of end points provided by our web service.

Contribution

The catalogue will be updated regularly. If you want to add a new dataset, use this form.

To contribute to the project development, please visit contributing instructions

Collaborative Work

Masader was developed in 2021 as part of the BigScience project for open research 🌸, a year-long initiative targeting the study of large langauge models and datasets. In 2022, Masader was furthere developed by the arbml team and community.

Citation

If you use Masader in research please cite the following papers.

@misc{alyafeai2021masader,
      title={Masader: Metadata Sourcing for Arabic Text and Speech Data Resources},
      author={Zaid Alyafeai and Maraim Masoud and Mustafa Ghaleb and Maged S. Al-shaibani},
      year={2021},
      eprint={2110.06744},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@article{altaher2022masader,
      title={Masader Plus: A New Interface for Exploring+ 500 Arabic NLP Datasets},
      author={Altaher, Yousef and Fadel, Ali and Alotaibi, Mazen and Alyazidi, and others},
      journal={arXiv preprint arXiv:2208.00932},
      year={2022}
}

ARBML/masader