macula-hebrew (םכלה)

Syntax trees, morphology, and linguistic annotations for the Hebrew Bible

This repository contains linguistic datasets for Hebrew, including data from:

MACULA Hebrew Linguistic Datasets © 2022 by Clear Bible, Inc, released under a Creative Commons CC BY 4.0 license.
The text of the Westminster Leningrad Codex, released into the public domain by the Groves Center, and available at tanach.us.
Morphology from the Open Scriptures Hebrew Bible, available on Github.
Syntax trees developed by Clear Bible, Inc. together with the Groves Center. (Note: Clear was formerly known as Global Bible Initiative from 2014-2020 and Asia Bible Society before that.) Recently, the Groves Center graciously released Westminster Hebrew Syntax without Morphology under a Creative Commons CC BY 4.0 license.
Word sense data from the United Bible Societies MARBLE project.
Cherith Glosses for the Hebrew Old Testament, by Andi Wu, Copyright (C) 2022 by Cherith Analytics, is licensed under a Creative Commons Attribution 4.0 International License ("CC BY 4.0").

During 2022, we intend to add further datasets, which are under development:

Synonyms: Which Hebrew words are related in meaning?
Semantic roles: Who does what to whom? (Agent, Verb, Patient …)
Participant referents: Who is “he,” “she,” or “it” in this sentence?
Semantic similarity: Which phrases and clauses have are semantically similar to texts found elsewhere?

This data has been combined into a single set of trees. There are two variants of the these trees, found in the following directories:

nodes contains this data in a set of nested Node elements suitable for many NLP systems and other systems that use recursive algorithms.
lowfat contains the same data in a form more suitable for some kinds of query systems and some kinds of display.