tshatrov/ichiran

JSON returned by ichiran/cli

Closed this issue · 4 comments

I'm using ichiran-cli with the -f argument to provide my scripts with the full segmentation of a sentence. I'm writing my own little parser for the returned JSON, but I'm having problems since there's lots of structure in it, and I don't know where to find everything or what exactly should I expect from it. Here is an example:

['itsuni', {'reading': '一に 【いつに】', 'text': '一に', 'kana': 'いつに', 'score': 128, 'seq': 1160930, 'gloss': [{'pos': '[adv]', 'gloss': 'solely; entirely; only; or'}], 'conj': []}, []]

Sometimes the 'gloss' list is empty, sometimes the 'conj' list is, and sometimes both in the case of e.g. suffixes. I don't know where to find the tense in the case of verbs.
Where can I find an explanation of the structure of the JSON that is returned by ichiran-cli?

The gloss is available for root words only. it is a list of definitions, each definition is itself a dictionary which has a part of speech (pos) and the definition itself (gloss). For non-root words there's no such information but they have a conj property which lists the conjugations that led to this word. For example

>ichiran-cli -f 食べた
[[[[["tabeta",{"reading":"\u98DF\u3079\u305F \u3010\u305F\u3079\u305F\u3011","text":"\u98DF\u3079\u305F","kana":"\u305F\u3079\u305F","score":336,"seq":10092434,"conj":[{"prop":[{"pos":"v1","type":"Past (~ta)"}],"reading":"\u98DF\u3079\u308B \u3010\u305F\u3079\u308B\u3011","gloss":[{"pos":"[v1,vt]","gloss":"to eat"},{"pos":"[vt,v1]","gloss":"to live on (e.g. a salary); to live off; to subsist on"}],"readok":true}]},[]]],336]]]

Each conjugation is itself a dictionary, with prop key describing the properties of conjugation, in this case it's v1 verb and 'Past ~ta' conjugation. There's also gloss provided for the root word from which this conjugation is derived (gloss is property of conjugation because technically the same word could sometimes be obtained by conjugating two different root words in a different way).

Another type of words are compound words. These have a components attribute which contains all the parts that make up the word. Most suffixes have a suffix attribute which explains the meaning of the suffix rather than gloss.

>ichiran-cli -f 食べたかった
[[[[["tabetakatta",{"reading":"\u98DF\u3079\u305F\u304B\u3063\u305F \u3010\u305F\u3079\u305F\u304B\u3063\u305F\u3011","text":"\u98DF\u3079\u305F\u304B\u3063\u305F","kana":"\u305F\u3079\u305F\u304B\u3063\u305F","score":546,"compound":["\u98DF\u3079","\u305F\u304B\u3063\u305F"],"components":[{"reading":"\u98DF\u3079 \u3010\u305F\u3079\u3011","text":"\u98DF\u3079","kana":"\u305F\u3079","score":0,"seq":10092474,"conj":[{"prop":[{"pos":"v1","type":"Continuative (~i)"}],"reading":"\u98DF\u3079\u308B \u3010\u305F\u3079\u308B\u3011","gloss":[{"pos":"[v1,vt]","gloss":"to eat"},{"pos":"[vt,v1]","gloss":"to live on (e.g. a salary); to live off; to subsist on"}],"readok":true}]},{"reading":"\u305F\u304B\u3063\u305F","text":"\u305F\u304B\u3063\u305F","kana":"\u305F\u304B\u3063\u305F","score":0,"seq":10445556,"suffix":"want to... / would like to...","conj":[{"prop":[{"pos":"adj-i","type":"Past (~ta)"}],"reading":"\u305F\u3044","gloss":[{"pos":"[suf,adj-i]","gloss":"very ...","info":"after a noun or the -masu stem of a verb; also \u3063\u305F\u3044"}],"readok":true}]}]},[]]],546]]]

Each part of the compound word might itself be a conjugation. For example this consists of 食べ which is a conjugation of 食べる and たかった which is a conjugation of suffix たい.

Thank you very much. That should cover most words one encounters regularly. Is there no reference where I can find every type of word and structure that could be returned? If there is none, consider the issue solved.

I've just tried to run ichiran-cli -f 食べた and the type value is empty:
['tabeta', {'reading': '食べた 【たべた】', 'text': '食べた', 'kana': 'たべた', 'score': 336, 'seq': 10172358, 'conj': [{'prop': [{'pos': 'v1', 'type': []}], 'reading': '食べる 【たべる】', 'gloss': [{'pos': '[v1,vt]', 'gloss': 'to eat'}, {'pos': '[vt,v1]', 'gloss': 'to live on (e.g. a salary); to live off; to subsist on'}], 'readok': True}]}, []]
Have i set up something wrong?

@blunderedbishop you're probably missing conj.csv file from here or *jmdict-data* variable is pointing to the incorrect path in settings.lisp