macula-hebrew (םכלה)
Syntax trees, morphology, and linguistic annotations for the Hebrew Bible
This repository contains linguistic datasets for Hebrew, including data from:
- MACULA Hebrew Linguistic Datasets © 2022 by Clear Bible, Inc, released under a Creative Commons CC BY 4.0 license.
- The text of the Westminster Leningrad Codex, released into the public domain by the Groves Center, and available at tanach.us.
- Morphology from the Open Scriptures Hebrew Bible, available on Github.
- Syntax trees developed by Clear Bible, Inc. together with the Groves Center. (Note: Clear was formerly known as Global Bible Initiative from 2014-2020 and Asia Bible Society before that.) Recently, the Groves Center graciously released Westminster Hebrew Syntax without Morphology under a Creative Commons CC BY 4.0 license.
- Word sense data from the United Bible Societies MARBLE project.
- Cherith Glosses for the Hebrew Old Testament, by Andi Wu, Copyright (C) 2022 by Cherith Analytics, is licensed under a Creative Commons Attribution 4.0 International License ("CC BY 4.0").
During 2022, we intend to add further datasets, which are under development:
- Synonyms: Which Hebrew words are related in meaning?
- Semantic roles: Who does what to whom? (Agent, Verb, Patient …)
- Participant referents: Who is “he,” “she,” or “it” in this sentence?
- Semantic similarity: Which phrases and clauses have are semantically similar to texts found elsewhere?
This data has been combined into a single set of trees. There are two variants of the these trees, found in the following directories:
nodes
contains this data in a set of nestedNode
elements suitable for many NLP systems and other systems that use recursive algorithms.lowfat
contains the same data in a form more suitable for some kinds of query systems and some kinds of display.
Copyright statements for the individual sources can be found in the MACULA Hebrew license.