/auditory

Neural network auditory processing code in Go focused on filtering speech wav files via mel filters

Primary LanguageGoBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

auditory

Auditory is the our repository for audition processing code in Go (golang) focused on filtering speech wav files via mel filters. A further step using gabors provides filtering for input to neural networks. The processing code is split into 4 packages, sound, mel, dft and agabor, that can be used independently. Example code is in examples/processspeech.

Packages

dft

  • The 'dft' package does a fourier transform and computes the power spectrum on the sound samples passed in.

mel

  • The 'mel' package creates a set of mel filter banks and applies them to the power data to create a spectrogram.

agabor

  • The 'agabor' package produces an edge detector that detects oriented contrast transitions between light and dark which can be convolved with the output of the mel processing.
  • There are 2 structs, FilterSet and Filter. You must create a FilterSet even if you are only adding one gabor Filter

sound

  • sound.go contains code for loading a wav file into a buffer and then converting to a floating point tensor. There are functions for trimming and padding.
  • sndenv.go is a higher level api that has code to process a sound in segments calling the sound code, mel code and gabor code
  • playwav.go can be called to play a wav file

speech

  • speech package has structs for Sequence and Unit
  • packages for specific sound sets (corpora) include code to load these sound files with timing information and lookup code.
    • Package timit Phones of the TIMIT database. See Speaker-Independent Phone Recognition Using Hidden Markov Models, Kai-Fu Lee and Hsiao-Wuen Hon in IEEE Transactions on Acoustics, Speech and Signal Processing, Vol 37, 1989
    • Package grafestes contains the consonant vowel names and timing information for the sound sequences used for the research reported in "Listening Through Voices: Infant Statistical Word Segmentation Across Multiple Speakers", Katherine Graf Estes & Lew-Williams, 2015.
    • Package synthcvs contains consonant vowel names and timing information for the synthesized speech generated with gnuspeech. These sounds are similar to the ones used by Saffran, Aslin & Newport, "Statistical Learning by 8-Month-Old Infants", 1996