faroese-corpus

Faroese corpus taken from Wikipedia dumps.

This repository will contain corpus of Faroese language taken from the content dump of Faroese Wikipedia.

`pipenv`

This project uses pipenv. How to install pipenv.

Dependencies

In order to read 7zip archives (used by Wikia's XML dumps) you need to install libarchive:

pipenv install
sudo apt install libarchive-dev

Scripts

Run pipenv shell before running them.

`words_from_dump.py`

Shows the longest words taken from the dump:

1 llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch - 58
2 samvinnufelagiðsamvinnufelagnum - 31
3 krabbameinsgranskingarstovnurin - 31
4 southernplayalisticadillacmuzik - 31
5 barnabókavirðislønavinnararnar - 30
6 norðurlandameistarakappingini - 29
7 sjónvarpsundirhaldssendingini - 29
8 bókmentakritikaraheiðurslønir - 29
9 einstaklingaítróttargreinunum - 29
10 vegsúkklukappingarmeistaranum - 29

macbre/faroese-corpus

faroese-corpus

`pipenv`

Dependencies

Links

Scripts

`words_from_dump.py`