AlbNLP

This repository is a list of resources for conducting NLP (Natural Language Processing) research in Albanian language. Each resource that is linked here has already been published somewhere else. This list is not exhaustive, since new corpora or other resources are released time to time. For this reason, anyone who knows about public NLP resources of Albanian language is welcome to add them here with a pull request.

AlbNews

A topic modeling corpus of news headlines in Albanian, consisting of 600 labeled samples and 2600 unlabeled samples. Each labeled sample includes a headline text retrieved from Albanian online news portals and one of the four labels: 'pol' for politics, 'cul' for culture, 'eco' for economy, and 'spo' for sport. The unlabeled samples contain the headline text only.
Related paper and code

AlbNER

A Named Entity Recognition corpus of Wikipedia sentences in Albanian, consisting of 900 records. The sentence tokens are manually labeled complying with the CoNLL-2003 shared task annotation scheme explained at https://aclanthology.org/W03-0419.pdf that uses I-ORG, B-ORG, I-PER, B-PER, I-LOC, B-LOC, I-MISC, B-MISC and O tags.
Related paper

AlbMoRe

A sentiment analysis corpus of movie reviews in Albanian, consisting of 800 records in CSV format. Each record includes a text review retrieved from IMDb and translated in Albanian. It also contains a 0 negative) or 1 (positive) label. The corpus is fully balanced, consisting of 400 positive and 400 negative reviews about 67 movies of different genres.
Related paper and code

SHAJ

An annotated Albanian dataset for hate speech and offensive speech that has been constructed from user-generated content on various social media platforms. Its annotation follows the hierarchical schema introduced in OffensEval.
Related paper

Social media comments in Albanian

Dataset of manually annotated social media comments in Albanian Language for Sentiment Analysis. It comprises comments collected from the official Facebook page of the National Institute of Public Health of Kosovo (NIPHK). These comments reflect the opinions of Kosovo citizens about the Covid-19 pandemics. This dataset contains a total of 10132 comments along with 12 attributes.
Related paper

erionc/albnlp

AlbNLP

AlbNews

AlbNER

AlbMoRe

SHAJ

Social media comments in Albanian

Autogenerated NER corpus + Albanian NE Gazetteer

Manually annotated Albanian NER corpora of news articles

Albanian News Articles Dataset