faroese-corpus
Faroese corpus taken from Wikipedia dumps.
This repository will contain corpus of Faroese language taken from the content dump of Faroese Wikipedia.
pipenv
This project uses pipenv
. How to install pipenv
.
Dependencies
In order to read 7zip archives (used by Wikia's XML dumps) you need to install libarchive
:
pipenv install
sudo apt install libarchive-dev
Links
Scripts
Run pipenv shell
before running them.
words_from_dump.py
Shows the longest words taken from the dump:
1 llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch - 58
2 samvinnufelagiðsamvinnufelagnum - 31
3 krabbameinsgranskingarstovnurin - 31
4 southernplayalisticadillacmuzik - 31
5 barnabókavirðislønavinnararnar - 30
6 norðurlandameistarakappingini - 29
7 sjónvarpsundirhaldssendingini - 29
8 bókmentakritikaraheiðurslønir - 29
9 einstaklingaítróttargreinunum - 29
10 vegsúkklukappingarmeistaranum - 29