/sylvia

Use phoneme-based regular expressions to find words in the Carnegie-Mellon Pronouncing Dictionary.

Primary LanguagePythonMIT LicenseMIT

Sylvia

Search pronunciations in the CMU Pronouncing Dictionary using a reglular-expression like syntax. Input-format regular expressions are lightly preprocessed into Python-format regular expressions, and then mapped over an encoded version of cmudict. Results are sorted by popularity using Peter Norvig’s list of word popularities derived from Google’s N-Gram dataset.

Here for the Emacs library?

You can skip this and jump directly into the sylvia-mode README!

Installation

brandon@brandon-babypad-linux ~> pip2 install sylvia

Usage

Interactive Sylvia prompt:

brandon@brandon-babypad-linux ~> python2 -m sylvia

    Type 'help' for options, press enter to quit.

sylvia> 

Run one-off command:

brandon@brandon-babypad-linux ~> python2 -m sylvia -c "regex G #* AE #* IH #* %"
Gravity             Graphical           Grandchildren       Garrison            Graphically         
Gallegos            Gravitate           Garretson           Gastineau           Gallimard           
Galligan            Grandison           Gallivan            Glatfelter          Garibay             
Garelick            Garrigan            Garriga             Gravitates          Galipeau            
Gavigan             Gamelin             Gateley             Grandillo           Galipault           
Garringer           Gradison            Grandchildren's     Glastetter          Garity              
Galliher            Gantenbein

Commands

Sylvia’s functionality is broken down into various subcommands. These commands can be run from the interactive prompt, or as single-lines directly from your system shell.

regex

This is the most powerful feature of Sylvia. It allows searches of cmudict based on phoneme patterns.

Sylvia’s query format is nearly identical to traditional Python 2 regular expressions, with the exception that it is intended not to match against patterns of characters, but rather patterns of phonemes. To construct a regular expression query for Sylvia, remember the following rules:

  1. Whitespace must be used to delimit consecutive phoneme literals. It may also be used anywhere else in the regular expression, as whitespace is meaningless in the context of a phoneme sequence, and will be stripped during preprocessing.
  2. `#` is a shortcut for “any consonant sound”
  3. `@` is a shortcut for “any vowel sound”
  4. `%` is a shortcut for “any syllable”, and is equivalant to `#*@#*`
  5. Otherwise, whatever flies with Python’s regular expression format will work in Sylvia. Just use some common sense, as some things (such as character classes) will be wholly inapplicable to searches in phoneme-space.

Use of this command is as follows:

sylvia> regex {regex tokens}

Consult Carnegie Mellon’s cmudict documentation to learn more about the phoneme set.

Consult the Python docs to learn more about Python’s regex format.

Examples

Find words starting with zero or more consonant sounds, followed by the “long E” sound (phoneme IY), followed by zero or more consonant sounds, followed by the “ed” sound (the phoneme sequence EH D):

sylvia> regex #* IY #* EH D
Steelhead     Seabed        Beachhead     Retread       Behead 

Find all six syllable words where the first syllable uses the “short i” sound (phoneme IH), and ends in either the D or P phonemes.

sylvia> regex #*IH%%%%%(D|P)
Differentiated        Individualized        Deteriorated          Institutionalized     
Incapacitated         Internationalized     Interrelationship     Misappropriated       
Disassociated         Discombobulated       Insubstantiated       

Note here that only five % symbols are needed, as a single vowel sound constitutes a single syllable, and we explicitly call out the first vowel sound via IH.

Find all words that start with the R sound, followed by some vowel, followed by the D sound, followed by another vowel, followed by the NG phoneme:

sylvia> regex R@D@NG
Reading     Riding      Redding     Raiding     Ridding     Reding      Rodding     Ruding      Rawding

lookup

If you just want to lookup the pronunciations for a word, you can do that too. This can be a good way to quickly learn the phonemes for a particular sound when constructing queries. Due to cultural and geographic variations in pronunciation, this command can return multiple sequences.

Use of this command is as follows:

sylvia> lookup {word}

Examples

sylvia> lookup turkmenistan
T ER K M EH N IH S T AE N     
sylvia> lookup capture
K AE P CH ER     
sylvia> lookup tomato
T AH M EY T OW     T AH M AA T OW     

rhyme

Sylvia can act as a rhyming dictionary, returning words which rhyme with a given word. There are three “rhyme levels”, which define how rhymes are determined.

  1. perfect lists all words which contain the same sequence of phonemes as the given word, including and following the first vowel in the given pronunciation. Before that vowel, the matched words can contain any sounds.
  2. default is the same as perfect, except that additional consonant sounds can be interspersed between the matched sequence phonemes.
  3. loose is similar, except it ignores consonant sounds entirely.

Use of this command is as follows:

sylvia> rhyme {rhyme-level} {word}

rhyme-level can be omitted if default behavior is desired.

There are plans to improve these models by matching phonemes based on their vocal characteristics. For example, all nasal phonemes may be considered matches by default, or all plosive sounds, etc. The behavior documented above is subject to change at any time.

Examples

List words which rhyme with “chatter”, using the perfect algorithm.

sylvia> rhyme perfect chatter
Matter             Latter             Batter             Mater              Platter            
Scatter            Flatter            Shatter            Hatter             Splatter           
Fatter             Patter             Antimatter         Clatter            Spatter            
Schlatter          Blatter            Natter             Sater              Satter             
Slatter            Tatter             Mcphatter          Chitterchatter     Smatter            
Vanatter           Vannater           Vatter             Vannatter          Mcfatter           
Wildcatter         

…using the default algorithm…

sylvia> rhyme chatter        
After                  Chapter                Matter                 Master                 
Factors                Factor                 Pattern                Faster                 
Matters                Webmaster              Patterns               Adapter                
Contractor             Contractors            Disaster               Actor                  
Masters                Latter                 Chapters               Actors                 
Adapters               Lancaster              Saturn                 Adaptor                
Pastor                 Thereafter             Tractor                Scattered              
Disasters              Ticketmaster           Napster                Laughter               
Reactor                Adaptors               Baxter                 Stratford              
Blaster                Lantern                Bastard                Maxtor                 
Tractors               Shattered              Plaster                Hereafter              
Subchapter             Batter                 Broadcasters           Antwerp                
Raptor                 Mater                  Platter                Scatter                
Hamster                Raster                 Subcontractor          Reactors               
Pastors                Subcontractors         Broadcaster            Mastered
... many more...

…and using the loose algorithm.

sylvia> rhyme loose chatter
After                  Standard               Password               Chapter                
Standards              Rather                 Matter                 Cancer                 
Answer                 Master                 Transfer               Answers                
Factors                Factor                 Pattern                Faster                 
Matters                Manner                 Webmaster              Patterns               
Hampshire              Adapter                Contractor             Banner                 
Contractors            Alexander              Capture                Disaster               
Actor                  Masters                Traveler               Latter                 
Albert                 Chapters               Packard                Answered               
Scanner                Bachelor               Actors                 Transfers              
Adverse                Amber                  Tracker                Transferred            
Planner                Hacker                 Commander              Adapters               
Scanners               Manufactured           Stanford               Manufacture            
Anchor                 Gathered               Travelers              Captured               
Grammar                Hazard                 Anger                  Gather                 
Lancaster              Hammer                 Manor                  Programmer             
Hazards                Bradford               Madagascar             Saturn                 
Banners                Passwords              Adaptor                Pastor                 
Hamburg                Ladder                 Flashers               Programmers            
Planners               Thereafter             Chancellor             Frankfurt              
Tractor                Wagner                 Hackers                Scattered              
Ballard                Disasters              Handler                Chandler               
Sanders                Ticketmaster           Napster                Banker                 
Dancer                 Dancers                Jasper                 Laughter               
Backward               Panthers               Captures               Bladder                
Sampler                Panther                Reactor                Stafford               
Backwards              Adaptors               Manufactures           Glamour                
Baxter                 Stratford              Blackburn              Amherst                
Blaster                Tavern                 Lambert                Fracture 
...many, many more...

infer

Sylvia can infer the pronunciation of unknown words using it’s own rule-based text-to-phoneme engine. Don’t expect great performance though – written English is only ostensibly phonetic, and rules-based approaches are not fantastic. Any deep-learning based solution to this problem is likely to beat the snot out of Sylvia’s engine.

Use of this command is as follows:

sylvia> infer {word}

Examples

Infer a pronunciation for the word “rooster”, then compare to the value from lookup.

sylvia> infer rooster
R UW S T ER     

sylvia> lookup rooster
R UW S T ER 

Infer pronunciations for some made-up words.

sylvia> infer rafloy
R AE F L OY     

sylvia> infer rabbilt
R AE B IH L T     

sylvia> infer fliberdoodle
F L IH B ER D UW D AH L   

lregex

Sylvia can lookup words based on normal regular expressions. This command doesn’t touch on anything phonetic, but may be useful in the same use-cases as Sylvia itself.

Use of this command is as follows:

sylvia> lregex {regex tokens}

Examples

Find all words which are spelled with a C at the start, a P at the end, and which contain either a T or a D.

sylvia> lregex c.*(t|d).*p
Citizenship         Craftsmanship       Countertop          Courtship           Catnip              
Citicorp            Conservatorship     Catsup              Crudup              Catchup             
Colstrip            Catnap              Cutlip              Coltharp            

popularity

You can ask Sylvia for the popularity of a word. This value depends on the data-source used when compiling the dictionary, but by default, it is the value in Peter Norvig’s word popularity list. Larger values indicate higher popularity (think occurrences, not rank).

Use of this command is as follows:

sylvia> popularity {word}

Examples

Find the popularity of a popular, typical, and rare word.

sylvia> popularity I
3086225277

sylvia> popularity green
108287905

sylvia> popularity teutonic
301907

Contributing and Other General Notes

For a list of known issues feature ideas, and links to relevant research and documentation, check out the development notes!