/thcg

Derive (minimally labeled) dependency trees from the Thai CG Bank

Primary LanguageRubyzlib LicenseZlib

THCG2DEP is an implementation of a method for deriving (minimally labeled) dependency trees from the Thai CG Bank (Ruangrajitpakorn & Supnithi 2010), using a lexical dictionary for assigning dependency directions to the CG types associated with the grammatical entities in the CG Bank, with fallback to a generic CG->CDG mapping in case of out-of-dictionary words.

Optionally uses distributional (“brown-”) clusters obtained through unsupervised processing of large amounts of (word-segmented) raw text in place of POS tags.

Requirements

A recent ruby interpreter is required, as well as the following ruby packages:

* commander (for command-line interface)
* conll (for generating dependency treebank files)

Usage

Treebank conversion

To extract dependency trees from a CG treebank, use the following command:

% ruby lib/thcg_converter.rb data/dict/merge.CDG data/map/CDG-CG.txt data/cg/sample.txt > data/conll/sample.conll

Explanation:

* data/dict/merge.CDG      - merged CDG lexical dictionary 
* data/map/CDG-CG.txt      - generic CG->CDG mapping
* data/cg/sample.txt       - source CG bank
* data/conll/sample.conll  - target file for dependency treebank (CONLL format)

Sample output:

Dictionary file: data/dict/merge.CDG
Building dictionary from data/dict/merge.CDG
38250 forms, 2 CDGS/form on average
Building CG->CDG map
Ambiguous CG->CDG mapping for น่า: (s\np)/(s\np) could be either (s\<np)/>(s\<np) or (s\<np)/<(s\<np)
Ambiguous CG->CDG mapping for ตอนนั้น: s/s could be either s/>s or s/<s
Ambiguous CG->CDG mapping for กับ: np\np/np could be either np\<np/>np or np\>np/>np
Ambiguous CG->CDG mapping for ตอนนี้: s/s could be either s/>s or s/<s
Ambiguous CG->CDG mapping for ขณะนี้: s/s could be either s/>s or s/<s
Ambiguous CG->CDG mapping for ขณะนั้น: s/s could be either s/>s or s/<s
Map file: data/map/CDG-CG.txt
Ambiguous CG->CDG mapping: np\np/np could be either np\>np/>np or np\<np/>np
Ambiguous CG->CDG mapping: s/s could be either s/>s or s/<s
Ambiguous CG->CDG mapping: s\s could be either s\>s or s\<s
Ambiguous CG->CDG mapping: (s\np)/(s\np) could be either (s\<np)/>(s\<np) or (s\<np)/<(s\<np)
Ambiguous CG->CDG mapping: s/(s\np) could be either s/>(s\<np) or s/<(s\<np)
Ambiguous CG->CDG mapping: s/(s\np)/np could be either s/>(s\<np)/>np or s/<(s\<np)/>np
Ambiguous CG->CDG mapping: s\(s\np) could be either s\<(s\<np) or s\>(s\<np)
90 CG->CDG mapping(s)
Treebank file: data/cg/sample.txt
Parsing 10 lines...
done
Processing 5 sentences...
No mapping of ((s\np)\(s\np))/np for 'ใน' - falling back on CD->CDG map
No mapping of (np\np)\(np\np) for 'ๆ' - falling back on CD->CDG map
done

Please note that CG types remain in the resulting CONLL file. These should be stripped out before using the dependency treebank for any experiments or real application, as the CG types cannot reliably be obtained automatically for unseen text.

Word cluster augmentation

To add unsupervised Brown clusters to a dependency treebank, use the command:

% ruby lib/cluster_path_map.rb data/conll/sample.conll data/paths/best-all-spaces-c32-p1.out/paths > data/conll/sample-c32.conll

Explanation:

* data/conll/sample.conll      - input dependency treebank
* data/paths/best-all-spaces-c32-p1.out/paths  - output file from the wcluster utility (http://www.cs.berkeley.edu/~pliang/software/brown-cluster-1.2.zip)
* data/conll/sample-c32.conll  - output file for augmented treebank

A second paths file may be specified, to include clusters of different granularity as a second POS tag.

Dependency parser training

To train a dependency parser, the output from the above steps should first be stripped of the CG types, shuffled (to compensate for clustering of similar sentence types) and then split in e.g. a 9-1 ratio for training and testing.

Given a train part and a test part, the standard MaltParser may be trained and tested as follows:

% java -jar ~/Tools/malt-1.4.1/malt.jar -c malt_test -i data/train.conll -m learn
% java -jar ~/Tools/malt-1.4.1/malt.jar -c malt_test -i data/test.conll -o malt_test.out.conll -m parse
% eval07.pl -q -g data/test.conll -s malt_test.out.conll

eval07.pl is the standard evaluation script for the task, available for download at e.g. nextens.uvt.nl/depparse-wiki/SoftwarePage#eval07.pl