Tools regarding the morphology of the Finnish language.
I have used the Finnish wordlist by Kotus when creating the programs.
- File formats
- Programs interesting to the end user
- Programs less interesting to the end user
- Programs even less interesting to the end user
- Text files
All files in this project use UTF-8 character encoding and Unix newlines.
CSV file format used in this project:
- field separator: comma (
,
) - no fields are quoted
- no empty fields
- types of fields: words (strings), declensions/conjugations (integers)
- lines in
plurals.csv
: two words (e.g.häät,hää
) - lines in all other CSV files: one word and zero or more declensions/conjugations (e.g.
ahtaus,39,40
)
Conjugate a Finnish verb. Arguments: VERB [MOOD TENSE VOICE [NUMBER [PERSON]]].
Moods: ind/con/pot/imp. Tenses: pre/pst/per. Voices: act/pss. Numbers: sg/pl.
Persons: 1/2/3. If 1 argument only, print all supported combinations.
Note: perfect tense and passive voice are not supported yet; only these combinations of mood/tense/voice are supported:
- indicative present active
- indicative past active
- conditional present active
- potentional present active
- imperative present active
Example:
$ python3 conjugate_verb.py "keritä" ind pre act sg 1
ind-pre-act-sg-1: kerkiän, keritsen
$ python3 conjugate_verb.py "soutaa"
ind-pre-act-sg-1: soudan
ind-pre-act-sg-2: soudat
ind-pre-act-sg-3: soutaa
ind-pre-act-pl-1: soudamme
ind-pre-act-pl-2: soudatte
ind-pre-act-pl-3: soutavat
ind-pst-act-sg-1: soudin, sousin
ind-pst-act-sg-2: soudit, sousit
ind-pst-act-sg-3: sousi, souti
ind-pst-act-pl-1: soudimme, sousimme
ind-pst-act-pl-2: souditte, sousitte
ind-pst-act-pl-3: sousivat, soutivat
con-pre-act-sg-1: soutaisin
con-pre-act-sg-2: soutaisit
con-pre-act-sg-3: soutaisi
con-pre-act-pl-1: soutaisimme
con-pre-act-pl-2: soutaisitte
con-pre-act-pl-3: soutaisivat
pot-pre-act-sg-1: soutanen
pot-pre-act-sg-2: soutanet
pot-pre-act-sg-3: soutanee
pot-pre-act-pl-1: soutanemme
pot-pre-act-pl-2: soutanette
pot-pre-act-pl-3: soutanevat
imp-pre-act-sg-2: souda
imp-pre-act-sg-3: soutakoon
imp-pre-act-pl-1: soutakaamme
imp-pre-act-pl-2: soutakaa
imp-pre-act-pl-3: soutakoot
Decline a Finnish noun. Arguments: NOUN [CASE NUMBER]. Cases: nom, gen, par,
ess, tra, ine, ela, ill, ade, abl, all, abe, ins. Numbers: sg, pl. If case &
number omitted, print all supported combinations.
Note: instructive singular and comitative are not supported.
Examples:
$ python3 decline_noun.py "kuusi" gen sg
GenSg: kuuden, kuusen
$ python3 decline_noun.py "kuusi"
NomSg: kuusi
NomPl: kuudet, kuuset
GenSg: kuuden, kuusen
GenPl: kuusien, kuusten, kuutten
ParSg: kuusta, kuutta
ParPl: kuusia
EssSg: kuusena, kuutena
EssPl: kuusina
TraSg: kuudeksi, kuuseksi
TraPl: kuusiksi
IneSg: kuudessa, kuusessa
InePl: kuusissa
ElaSg: kuudesta, kuusesta
ElaPl: kuusista
IllSg: kuuseen, kuuteen
IllPl: kuusiin
AdeSg: kuudella, kuusella
AdePl: kuusilla
AblSg: kuudelta, kuuselta
AblPl: kuusilta
AllSg: kuudelle, kuuselle
AllPl: kuusille
AbeSg: kuudetta, kuusetta
AbePl: kuusitta
InsPl: kuusin
Find partially homonymous inflected nouns and verbs. Slow.
partial-homonyms.txt
was generated with this program.
Print lemma and inflected forms of nouns and verbs with specified length. Arguments: minimumLength maximumLength
Example:
$ python3 inflected_by_length.py 22 22
kansalaisuudettomuudet
suunnittelemattomuudet
...
Argument: a Finnish noun (including adjectives/pronouns/numerals, excluding compounds) in nominative singular. Print the Kotus declension(s) (1-49) and whether consonant gradation applies.
Example:
$ python3 noun_consgrad.py "kuusi"
Declension 24 (like "un|i, -en, -ien/-ten, -ta, -ia, -een, -iin") without
consonant gradation
Declension 27 (like "kä|si, -den, -sien/-tten, -ttä, -siä, -teen, -siin")
without consonant gradation
Needs noundecl.py
and countsyll.py
. Can be tested with test-nounverb.py
.
Argument: a Finnish noun (including adjectives/pronouns/numerals, excluding compounds) in nominative singular. Print the Kotus declension(s) (1-49).
Example:
$ python3 noundecl.py "kuusi"
Declension 24 (like "un|i, -en, -ien/-ten, -ta, -ia, -een, -iin")
Declension 27 (like "kä|si, -den, -sien/-tten, -ttä, -siä, -teen, -siin")
Needs countsyll.py
. Can be tested with test-nounverb.py
.
Argument: a Finnish verb (not a compound) in the infinitive. Print the Kotus conjugation(s) (52-76) and whether consonant gradation applies.
Example:
$ python3 verb_consgrad.py "keritä"
Conjugation 69 (like "vali|ta, -tsen, -tsi, -tsisi, -tkoon, -nnut, -ttiin")
without consonant gradation
Conjugation 75 (like "selvi|tä, -än, -si, -äisi, -tköön, -nnyt, -ttiin") with
consonant gradation
Needs verbconj.py
. Can be tested with test-nounverb.py
.
Argument: a Finnish verb (not a compound) in the infinitive. Print the Kotus conjugation(s) (52-76).
Example:
$ python3 verbconj.py "keritä"
Conjugation 69 (like "vali|ta, -tsen, -tsi, -tsisi, -tkoon, -nnut, -ttiin")
Conjugation 75 (like "selvi|tä, -än, -si, -äisi, -tköön, -nnyt, -ttiin")
Needs countsyll.py
. Can be tested with test-nounverb.py
.
Count the number of syllables in a Finnish word. Argument: word
Example:
$ python3 countsyll.py "liioitella"
Syllables: 4 or more, or the word is unknown
Split a Finnish compound. Argument: compound to split.
Example:
$ python3 splitcomp.py "ylivoimamaali"
yli_voima_maali
Needs generated-lists/nonfinals.txt
and generated-lists/finals.csv
which
can be generated with extract.sh
.
TODO: make the program more space efficient (those word lists are more than a hundred kilobytes together).
Converts the Kotus XML file (link above) into CSV files that are needed by the other programs. Warning: overwrites the files listed below.
Note: before running this script, extract the Kotus XML file to the same directory as this project.
Creates the subdirectory generated-lists/
and generates these files under it:
words-orig.csv
: the original words (no leading/trailing apostrophes/hyphens/spaces) (~94,000 words)words.csv
: words without plurals or compounds but with singular forms of plurals and finals of compounds (~41,000 words)nouns.csv
: nouns (Kotus declensions 1–49) fromwords.csv
(~26,000 words)nouns-1syll.csv
: monosyllabic nounsnouns-2syll.csv
: disyllabic nounsnouns-3syll.csv
: trisyllabic nounsnouns-4syll.csv
: quadrisyllabic and longer nouns
verbs.csv
: verbs (Kotus conjugations 52–76) fromwords.csv
(~9,400 words)verbs-1syll.csv
: monosyllabic verbsverbs-2syll.csv
: disyllabic verbsverbs-3syll.csv
: trisyllabic verbsverbs-4syll.csv
: quadrisyllabic and longer verbs
words-consgrad.csv
: likewords.csv
but only the words to which consonant gradation applies (~11,000 words)finals.csv
: words that occur as final parts of compounds (and possibly non-finally or alone) (~8,400 words)nonfinals.txt
: words that occur as non-final parts of compounds (not finally but possibly alone) (~5,300 words)compositives.txt
: words that occur as non-final parts of compounds (not finally or alone) (~2,900 words)
Also generates stats-nounverb.txt
under the current directory (see text files).
Test conjugate_verb.py
. No arguments.
Test decline_noun.py
. No arguments.
Argument: which program to test ('n'=noundecl.py, 'v'=verbconj.py,
'ng'=noun_consgrad.py, 'vg'=verb_consgrad.py).
Needs files created by extract.sh
.
Test splitcomp.py
against known single words and compounds.
Requires generated-lists/words.csv
which can be generated with extract.sh
.
These are only meant to be used by extract.sh
.
Read Kotus XML file, print distinct words and their declensions/conjugations (0-2) in CSV format. Arguments: XML file, which words ('a' = all, 'g' = only those that consonant gradation applies to).
Get words that occur as finals of compounds. Print them and their declensions/conjugations in CSV format. Arguments: wordCsvFile compoundListFile
Arguments: one or more CSV files. For each distinct word, print a CSV line with all declensions/conjugations occurring with that word in the files.
Arguments: CSV file with words and declensions/conjugations, CSV file with plurals and singulars. Print words and declensions/conjugations in CSV format, with plurals replaced with singulars.
Arguments: CSV file with words and declensions/conjugations, list file with compounds. Print CSV lines without those that contain a compound.
Arguments: CSV file with words and declensions/conjugations, first declension/conjugation, last declension/conjugation. Print lines that contain declensions/conjugations within that range.
Arguments: CSV file, syllable count (1-4; 4=4 or more). Print lines containing a word with that many syllables.
Print words that only occur as non-final parts of compounds (not final). Argument: compound list file
Print words that only occur as non-final parts of compounds (not final or alone). Arguments: compound list file, word CSV file
Print a table of noun/verb counts by declension/conjugation, syllable count and ending. Argument: CSV file with words (no compounds).
A list of compounds on the Kotus word list. Creating this list involved a lot of manual work.
Notes:
- One compound per line.
- The individual words of each compound have been separated by underscores (
_
), e.g.yli_oppilas_tutkinto_lauta_kunta
. - No other character denotes a word boundary inside a compound; for example, these compounds are only two individual words each:
jok'_ainoa
valo-_oppi
suomen _kieli
vaa'an_kieli
tax-free-_myynti
all stars -_joukkue
- Includes "plural only" words (e.g.
hopea_häät
). - Examples of words/prefixes/suffixes of Latin/Greek origin I didn't consider separate words:
- prefixes: a-, di-, dis-, in-, inter-, iso-, multi-, poly-, post-, pre-, re-, sub-, syn-, tri-
- suffixes: -grafi(nen/a), -grammi, -kroninen, -logi(nen/a), -metri(nen/a) ("metri" as a device, not as a unit), -paatti(nen)/-patia, -skooppi(nen)/-skopia
- Tip: to restore a compound to its original form, simply delete all underscores.
- Tip: to split a compound properly:
- first split by underscores…
- …then strip leading/trailing apostrophes/hyphens/spaces (
'-
) from each individual word - e.g.
all stars -_joukkue
becomesall stars
andjoukkue
- The GPLv3 license does not apply to this file (I think) because it is largely based on the Kotus wordlist.
A list of partially homonymous inflected nouns and verbs. Automatically
generated with find-partial-homonyms.py
.
A list of "plural only" words on the Kotus list.
Notes:
- Two fields on each line: a word in plural and its singular form (e.g.
sakset,saksi
). - No compounds (e.g.
seppeleensitojaiset
). - Includes words that only occur as the final part of a compound, not alone (e.g.
sitojaiset
).
A table of noun/verb counts by declension/conjugation, syllable count and ending. Automatically generated.