/faroese-corpus

Some Faroese language statistics taken from fo.wikipedia.org content dump

Primary LanguagePythonMIT LicenseMIT

faroese-corpus

Faroese corpus taken from Wikipedia dumps.

This repository will contain corpus of Faroese language taken from the content dump of Faroese Wikipedia.

pipenv

This project uses pipenv. How to install pipenv.

Dependencies

In order to read 7zip archives (used by Wikia's XML dumps) you need to install libarchive:

pipenv install
sudo apt install libarchive-dev

Links

Scripts

Run pipenv shell before running them.

words_from_dump.py

Shows the longest words taken from the dump:

1 llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch - 58
2 samvinnufelagiðsamvinnufelagnum - 31
3 krabbameinsgranskingarstovnurin - 31
4 southernplayalisticadillacmuzik - 31
5 barnabókavirðislønavinnararnar - 30
6 norðurlandameistarakappingini - 29
7 sjónvarpsundirhaldssendingini - 29
8 bókmentakritikaraheiðurslønir - 29
9 einstaklingaítróttargreinunum - 29
10 vegsúkklukappingarmeistaranum - 29