AlbNLP

This repository is a list of resources for conducting NLP (Natural Language Processing) research in Albanian language. Each resource that is linked here has already been published somewhere else. This list is not exhaustive, since new corpora or other resources are released time to time. For this reason, anyone who knows about public NLP resources of Albanian language is welcome to add them here with a pull request.


A topic modeling corpus of news headlines in Albanian, consisting of 600 labeled samples and 2600 unlabeled samples. Each labeled sample includes a headline text retrieved from Albanian online news portals and one of the four labels: 'pol' for politics, 'cul' for culture, 'eco' for economy, and 'spo' for sport. The unlabeled samples contain the headline text only.
Related paper and code


A Named Entity Recognition corpus of Wikipedia sentences in Albanian, consisting of 900 records. The sentence tokens are manually labeled complying with the CoNLL-2003 shared task annotation scheme explained at https://aclanthology.org/W03-0419.pdf that uses I-ORG, B-ORG, I-PER, B-PER, I-LOC, B-LOC, I-MISC, B-MISC and O tags.
Related paper


A sentiment analysis corpus of movie reviews in Albanian, consisting of 800 records in CSV format. Each record includes a text review retrieved from IMDb and translated in Albanian. It also contains a 0 negative) or 1 (positive) label. The corpus is fully balanced, consisting of 400 positive and 400 negative reviews about 67 movies of different genres.
Related paper and code


An annotated Albanian dataset for hate speech and offensive speech that has been constructed from user-generated content on various social media platforms. Its annotation follows the hierarchical schema introduced in OffensEval.
Related paper


Dataset of manually annotated social media comments in Albanian Language for Sentiment Analysis. It comprises comments collected from the official Facebook page of the National Institute of Public Health of Kosovo (NIPHK). These comments reflect the opinions of Kosovo citizens about the Covid-19 pandemics. This dataset contains a total of 10132 comments along with 12 attributes.
Related paper


Albanian named entities annotation corpus generated automatically (silver-standard) from Wikipedia and WikiData. It is offered in Apache OpenNLP annotation format. Included is a NE Gazetter generated from Wikipedia.
Related paper


This is a small manually annotated NER corpora of Albanian news articles. The details of the generation approach and its evaluation are detailed in the related published article. It is provided in Apache OpenNLP annotation format.
Related paper


This corpus contains more than 3 million news articles from various albanian news sources.