/audio-cmn

Chinese (zh-cnm) opendata audio files for 8,596 hsk words and 1,707 syllabs.

Primary LanguageShell

Audio-cmn aims to provide hight quality & easy to use Chinese words audio recordings for modern web & mobile application. Audio-cmn is all

  • an original work by the recording of chinese syllabs
  • a curation work reusing pre-existing audios from SWAC Recorder
  • a post-processing work by providing light & optimised .mp3 files rather than the huge original .flac files. These audios are thus suitable for mobile application developments.

Voices

# of items Naming Set's specifics Authorship
1707 cmn-zi4.mp3 syllabs v.2 Chen Wang, CC-by-sa
+8000 cmn-名字.mp3 HSK_2000 list (words,zi) Yue Tan, CC-by-sa

Structure

Type of data :

  • .../syllabs/cmn-{tonedPinyin}.mp3 : 1707 chinese syllabs (all)
  • .../hsk/cmn-{hanzi}.mp3 : 5,596 HSK_2000 words and characters

Qualities

  • /96k/ - best audio quality, improvement from 64k is not perceptible.
    • no syllabs folder
    • /96k/hsk/
  • /64k/ - optimal audio quality for voice recording.
    • /64k/syllabs/
    • /64k/hsk/
  • 24k-abr - brutally optimized : ~2 times lighter, for 80% of the audio quality.
    • /24k-abr/syllabs/
    • /24k-abr/hsk/
  • 18k-abr - brutally optimized : ~3 times lighter, for 60% of the audio quality.
    • /18k-abr/syllabs/
    • /18k-abr/hsk/

Dependencies

sudo apt install libav-tools
sudo apt install lame
curl -L -C - 'http://download.shtooka.net/cmn-caen-tan_flac.tar' -o ./cmn-caen-tan_flac.tar
unrar e -o- './cmn-caen-tan_flac.tar'                                 # '*.flac' ./flac/

Missing audios ?

The current HSK audio database was build upon the official HSK 2000, published in 2000. The HSK 2000 is thus near fully covered (at least 8596 out of ~8800). List comparison with the last HSK 2012 words list is available and done via :

bash ./hsk-missing-audios.bash  HSK2012_all.txt    # List missing audios, compared to input list of words
bash ./missing-audios.bash --help              # Tiny manual

Current difference: 582 HSK2012 words which are missing human audios. See files in ./lists/ .

Credits

  • Speakers -- see table upper
  • Hugo Lopez, PLIDAM, INALCO -- Project management, repository, audio compression, file renaming
  • Nicolas Vion -- recording software & technical support

Log

  • v.0.1.0: clean up data by deleting the cmn-*5.ext items since copies of cmn-*1.mp3
  • v.0.1.1: add ./18k-abr (<40MB), an optimized version of ./64kb with understable sound quality
  • v.0.1.2: improve README.md ; Add ./lists/ and script for comparison with the HSK2012.
  • v.2.0.0: [BREAKING CHANGE] Merge back former /hskzi/ and /hsk/ back together. [Others]: fix for critical bug on some audios ; Add 24k and 96k ; share the conversion commands via compress-raw.bash

License

  • CC-by-sa. See table upper for authors.