/magyarlanc

The toolkit called magyarlanc aims at the basic linguistic processing of Hungarian texts. The toolkit consists of only JAVA modules (there are no wrappers for other programming languages), which guarantees its platform independency and its ability to be integrated into bigger systems (e.g. web servers).

Primary LanguageJava

How to use from commandline

Parameters:

    mode

    It defines the process(es) to be executed. Possible values are:
        morphparse (segmentation and POS-tagging)
        depparse (segmentation, POS-tagging and dependency parsing)
        constparse (segmentation, POS-tagging and constituency parsing)
        morana (possible morphological analyses of a given word)
        gui (graphical user interface)

    input

    It defines the input file on which the process will be executed. The input file must be a txt file containing running (raw) text.

    output

    It defines the output file in which the analysis will be saved.
        In the case of morphparse, the output file has the following structure. One line corresponds to one token and sentences are separated by an empty line. The first column contains the wordform, the second one contains the lemma and the third one contains the MSD code.
        In the case of depparse, the output file has the following structure. One line corresponds to one token and sentences are separated by an empty line. The first column contains the identifier of the word within the sentence, the second column contains the wordform, the third one the lemma, the fourth one the MSD code, the fifth one the part of speech, the sixth one the morphological features, the seventh one the identifier of the parent node, and finally the eighth one contains the dependency label.
        In the case of constparse, the output file has the following structure. One line corresponds to one token and sentences are separated by an empty line. The first column contains the identifier of the word within the sentence, the second column contains the wordform, the third one the lemma, the fourth one the MSD code, the fifth one the part of speech, the sixth one the morphological features, and the seventh one contains the syntactic label.

    encoding

    This is an optional parameter, with which the character encoding of the input and output files can be defined. By default, UTF-8 is used.

    spelling

    In the case of morana, this defines the word to be analyzed by the morphological analyzer.
Examples:

    java -Xmx1G -jar magyarlanc-3.0.jar -mode morphparse -input in.txt -output out.txt

    java -Xmx2G -jar magyarlanc-3.0.jar -mode depparse -input in.txt -output out.txt -encoding ISO-8859-2

    java -Xmx2G -jar magyarlanc-3.0.jar -mode gui

    java -Xmx2G -jar magyarlanc-3.0.jar -mode morana -spelling almáknak

magyarlanc