We query a set of Greek texts, hand-encoded for morphology and syntax (as treebanks) by Vanessa Gorman, to explore complexity in Greek sentence. The treebanks and queries in this repository are published under a CC-BY license.
- The encoded texts (Alpheios dependency scheme), cloned from the Greek-Dependency-Trees repository, are in data directory
- Various XQuery scripts to transform and analyze the files are in scripts
- Reports made by scripts are in info
Download the files or clone the repository. Install BaseX XML database.
In BaseX, run the script create-grccomp-db.xq to create the grc-com
database. Query the database by running other scripts in the scripts/xq directory. Adapt the scripts to query as needed.
- Create the
grc-com
database: create-grccomp-db.xq - Get basic information about the database, how many words, sentences, documents: db-basic-info.xq
- Get stats on sentence length: db-stats-sentence.xq
- Get stats on relations: db-stats-relations
- Which POS have role of PRED (and similar): list-pred-types.xq
- Which POS have role of COORD (and similar): list-coord-types.xq
- For a subset of sentences (based on number of elements, words etc), list lemmata: list-lemmata.xq
- For a lemma in a subset of sentences (based on number of elements), list its syntactic relations: lemma-list-functions.xq
- For a specific syntactic relation of lemma in a subset, list all sentences: relation-lemma-12-18-words.xq
- Find sentences with all basic roles (PRED, SBJ, OBJ, ADV): find-sentences-all-basic-roles.xq
- Find sentences with ellipsis (a role is missing and is artificially added during annotation), exactly 6 sentence elements: find-ellipsis.xq
- Find sentences with 12 words or less where PRED is adjective: find-sentences-with-pred-adj.xq
- Find sentences with 12 words or less where PRED is conjunction: find-sentences-with-pred-conj.xq
- Find sentences with 15 words or less without PRED: find-sentences-no-pred.xq
- Sentences with PRED and COORD dependent on sentence root: find-pred-coord-0.xq
- Find sentences with 12 words or less where the article is not ATR (or its variations): find-article-not-atr.xq
- Find sentences with COORD by asyndeton (u): find-coord-sentences-asyndeton.xq
- Find sentences with PRED_CO: find-coord-pred-co.xq
- Find sentences with some number of words where some word has some _CO function: find-suffix-co.xq
- Find infinitive used as PRED: find-pred-inf.xq
- Find sentences without AuxY: find-sentences-no-auxy.xq
- Find sentences with many AuxY: find-sentences-with-many-auxy.xq
- Find sentences without OBJ, PNOM, SBJ (and combinations): find-no-sbj-obj-pnom.xq
- Find sentences without nouns or adjectives: find-no-nouns.xq
- List syntactic roles of participles with frequencies of occurrences: find-participles-roles.xq
- Find substantivated participles: find-participles-substantivated.xq
- Find substantivated infinitives: find-infinitives-substantivated.xq
- Find sentences where article is head: find-sentences-with-subst-expr.xq
- Find sentences with transitive verbs as PRED without OBJ: find-sentences-no-obj.xq; the list of transitive verbs was compiled with find-verbs-obj.xq
- Find verbs ruling PNOM which appear without PNOM as well: find-sentences-no-pnom.xq; the list of verbs ruling PNOM was compiled with find-pnom-pred.xq
- Database: grc-com
- Date: 2022-06-02+02:00
- Documents: 153
- Sentences: 26781
- Words: 633763
- Stats on relations: relations-stats.md
- Stats on PRED: pred-stats.md
- Stats on COORD: coord-stats.md
- Sentences with all basic roles (PRED, SBJ, OBJ, ADV) expressed: sentences-basic-roles.md
- Sentences with ellipsis (artificially added elements), 6 sentence elements: sentences-ellipsis-6.md
- Sentences with PRED adjective: sentences-pred-adj.md
- Sentences with PRED conjunction: sentences-pred-c.md
- Sentences without PRED relation: sentences-no-pred.md
- Sentences where the article is not ATR: sentences-article-not-atr.md
- Sentences with COORD performed by punctuation (asyndeton): sentences-coord-asyndeton.md
- Sentences with PRED_CO: sentences-pred-co.md
- Sentences with infinitives used as PRED: sentences-inf-pred.md
- Sentences without AuxY (particles): sentences-no-auxy.md
- Sentences with many AuxY: sentences-many-auxy.md
- Sentences without OBJ, PNOM, SBJ (and combinations): no-sbj-obj-pnom.md
- Sentences without nouns or adjectives: no-nouns-adj.md
- Sentences with transitive verbs (active) as PRED, no OBJ: sentences-trans-no-obj.md
- Syntactic roles of participles: roles-participles.md
- Sentences with substantivated participles: subst-participles.md
- Sentences with substantivated infinitives: subst-inf.md
- Sentences where article is head: article-head.md
- Sentences with verbs taking PNOM in which the verbs are PRED but have no PNOM: pnom-no-pnom.md
- Landing page with list of functions
- Basic information on treebanks
- Retrieve a subset of sentences based on word count (default: 12 to 18 elements)
- List lemmata in a subset of sentences (default: 12 to 18 elements)
- List relations (sentence functions) for a lemma (default: καί, 12 to 18 elements)
- For relation of lemma, list sentences in subset (default: καί as PRED, 12 to 18 elements)
- Retrieve a subset of sentences without participles
- Retrieve a subset of sentences without participles and subordinate conjunctions
- Retrieve a subset of sentences without participles, infinitives, and subordinate conjunctions
- Retrieve a subset based on number of words, with PRED and COORD dependent on sentence root
- Modules (xqm, directory
/scripts/webapp/repo/
)- Functions for analysing treebanks (in general): grccom-analysis.xqm
- Functions for displaying HTML (in general): grccom.xqm
- Functions for individual pages (xq, directory
/scripts/webapp/app/grccom
)- Landing page: grccom-home.xq
- Basic information on database: grccom-basic-ana.xq
For syntactic roles, see the description by Giuseppe G. A. Celano, Guidelines for the Ancient Greek Dependency Treebank 2.0.
Data Format
The data given in this treebank is provided as an XML document. Each
word contains six required attributes:
id: This is a unique identifier, and corresponds to the word's linear
position in the sentence. The first word in a sentence is given
id 1.
cid: This is a canonical identifier for the word within the larger corpus.
form: The token form of the word.
lemma: The base lemma from which the word is derived, in Beta Code.
head: The id of the word's parent. If a word depends on the sentence
root, its head is 0.
relation: The syntactic relation between the word and its parent. A
catalogue of syntactic tags can be found in the syntactic guidelines
described below.
postag: The morphological analysis for the word. This field is 9
characters long, and corresponds to the following morphological
features:
1: part of speech
n noun
v verb
t participle
a adjective
d adverb
l article
g particle
c conjunction
r preposition
p pronoun
m numeral
i interjection
e exclamation
u punctuation
2: person
1 first person
2 second person
3 third person
3: number
s singular
p plural
d dual
4: tense
p present
i imperfect
r perfect
l pluperfect
t future perfect
f future
a aorist
5: mood
i indicative
s subjunctive
o optative
n infinitive
m imperative
p participle
6: voice
a active
p passive
m middle
e medio-passive
7: gender
m masculine
f feminine
n neuter
8: case
n nominative
g genitive
d dative
a accusative
v vocative
l locative
9: degree
c comparative
s superlative
---
For example, the postag for the noun "a)/ndra" is "n-s---ma-",
which corresponds to the following features:
1: n noun
2: -
3: s singular
4: -
5: -
6: -
7: m masculine
8: a accusative
9: -
- Neven Jovanović (nevenjovanovic), Department of Classical Philology, Faculty of Humanities and Social Sciences, University of Zagreb; orcid.org/0000-0002-9119-399X