CLDF dataset derived from Greenhill et al.'s "Austronesian Basic Vocabulary Database" from 2020 focusing on Oceanic languages

How to cite

If you use these data please cite

the original source

Greenhill, S.J., Blust. R, & Gray, R.D. (2008). The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics. Evolutionary Bioinformatics, 4:271-283.
the derived dataset using the DOI of the particular released version you were using

Description

This dataset is licensed under a CC-BY-4.0 license

Available online at https://abvd.shh.mpg.de/austronesian/

Conceptlists in Concepticon:

Blust-2008-210

Notes

Notes:

Making a Nexus File:

You will need to have the lexibank dataset installed. Probably best outside the directory:

# set up and install a virtual environment
python -m venv env
source ./env/bin/activate

# clone git repository
git clone https://github.com/lexibank/abvdoceanic

# or update repository
cd abvd_oceanic
git checkout main
git pull
cd ..

# install dataset
cd abvd_oceanic
pip install -e .
cd ..

To make a nexus file, use the custom abvdoceanic.nexus in cldfbench. The parameters are:

--output=/path/to/filename.nex = the output file to write.
--ascertainment={token} add BEASTs ascertainment correction if you want.
- overall - one ascertainment character added for overall correction.
- word - per word ascertainment correction.
--removecombined={int} - set level at which to filter combined cognates.

# make a nexus file, with combined cognates removed above level 2:
cldfbench abvdoceanic.nexus --removecombined 2 --output abvdoceanic.nex

# ...with per-word ascertainment correction:
cldfbench abvdoceanic.nexus --ascertainment=word --removecombined 2 --output abvdoceanic.nex

Statistics

Varieties: 418 (linked to 411 different Glottocodes)
Concepts: 191 (linked to 191 different Concepticon concept sets)
Lexemes: 78,515
Sources: 0
Synonymy: 1.14
Cognacy: 74,236 cognates in 9,490 cognate sets (2,308 singletons)
Cognate Diversity: 0.12
Invalid lexemes: 0
Tokens: 392,135
Segments: 432 (0 BIPA errors, 0 CLTS sound class errors, 431 CLTS modified)
Inventory size (avg): 30.64

Possible Improvements:

Entries missing sources: 78515/78515 (100.00%)

Contributors

Name	GitHub user	Description	Role
Simon J. Greenhill	@SimonGreenhill	maintainer	Author
Mary Walworth	@maryewal	maintainer	Author
Isaac Stead	@antipodite	maintainer	Author
Tihomir Rangelov	@tihomirrangelov	maintainer	Author
Johann-Mattis List	@lingulist	orthography profiles	Editor
Frederic Blum	@FredericBlum	orthography profiles	Editor

CLDF Datasets

The following CLDF datasets are available in cldf:

CLDF Wordlist at cldf/cldf-metadata.json