CLDF dataset derived from Greenhill et al.'s "Austronesian Basic Vocabulary Database" from 2020 focusing on Oceanic languages
If you use these data please cite
- the original source
Greenhill, S.J., Blust. R, & Gray, R.D. (2008). The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics. Evolutionary Bioinformatics, 4:271-283.
- the derived dataset using the DOI of the particular released version you were using
This dataset is licensed under a CC-BY-4.0 license
Available online at https://abvd.shh.mpg.de/austronesian/
Conceptlists in Concepticon:
You will need to have the lexibank dataset installed. Probably best outside the directory:
# set up and install a virtual environment
python -m venv env
source ./env/bin/activate
# clone git repository
git clone https://github.com/lexibank/abvdoceanic
# or update repository
cd abvd_oceanic
git checkout main
git pull
cd ..
# install dataset
cd abvd_oceanic
pip install -e .
cd ..
To make a nexus file, use the custom abvdoceanic.nexus
in cldfbench. The parameters are:
- --output=/path/to/filename.nex = the output file to write.
- --ascertainment={token} add BEASTs ascertainment correction if you want.
-
overall
- one ascertainment character added for overall correction.
-
word
- per word ascertainment correction.
- --removecombined={int} - set level at which to filter combined cognates.
# make a nexus file, with combined cognates removed above level 2:
cldfbench abvdoceanic.nexus --removecombined 2 --output abvdoceanic.nex
# ...with per-word ascertainment correction:
cldfbench abvdoceanic.nexus --ascertainment=word --removecombined 2 --output abvdoceanic.nex
- Varieties: 418 (linked to 411 different Glottocodes)
- Concepts: 191 (linked to 191 different Concepticon concept sets)
- Lexemes: 78,515
- Sources: 0
- Synonymy: 1.14
- Cognacy: 74,236 cognates in 9,490 cognate sets (2,308 singletons)
- Cognate Diversity: 0.12
- Invalid lexemes: 0
- Tokens: 392,135
- Segments: 432 (0 BIPA errors, 0 CLTS sound class errors, 431 CLTS modified)
- Inventory size (avg): 30.64
- Entries missing sources: 78515/78515 (100.00%)
Name | GitHub user | Description | Role |
---|---|---|---|
Simon J. Greenhill | @SimonGreenhill | maintainer | Author |
Mary Walworth | @maryewal | maintainer | Author |
Isaac Stead | @antipodite | maintainer | Author |
Tihomir Rangelov | @tihomirrangelov | maintainer | Author |
Johann-Mattis List | @lingulist | orthography profiles | Editor |
Frederic Blum | @FredericBlum | orthography profiles | Editor |
The following CLDF datasets are available in cldf:
- CLDF Wordlist at cldf/cldf-metadata.json