Search pronunciations in the CMU Pronouncing Dictionary using a reglular-expression like syntax. Input-format regular expressions are lightly preprocessed into Python-format regular expressions, and then mapped over an encoded version of cmudict. Results are sorted by popularity using Peter Norvig’s list of word popularities derived from Google’s N-Gram dataset.
You can skip this and jump directly into the sylvia-mode README!
brandon@brandon-babypad-linux ~> pip2 install sylvia
Interactive Sylvia prompt:
brandon@brandon-babypad-linux ~> python2 -m sylvia Type 'help' for options, press enter to quit. sylvia>
Run one-off command:
brandon@brandon-babypad-linux ~> python2 -m sylvia -c "regex G #* AE #* IH #* %" Gravity Graphical Grandchildren Garrison Graphically Gallegos Gravitate Garretson Gastineau Gallimard Galligan Grandison Gallivan Glatfelter Garibay Garelick Garrigan Garriga Gravitates Galipeau Gavigan Gamelin Gateley Grandillo Galipault Garringer Gradison Grandchildren's Glastetter Garity Galliher Gantenbein
Sylvia’s functionality is broken down into various subcommands. These commands can be run from the interactive prompt, or as single-lines directly from your system shell.
This is the most powerful feature of Sylvia. It allows searches of cmudict based on phoneme patterns.
Sylvia’s query format is nearly identical to traditional Python 2 regular expressions, with the exception that it is intended not to match against patterns of characters, but rather patterns of phonemes. To construct a regular expression query for Sylvia, remember the following rules:
- Whitespace must be used to delimit consecutive phoneme literals. It may also be used anywhere else in the regular expression, as whitespace is meaningless in the context of a phoneme sequence, and will be stripped during preprocessing.
- `#` is a shortcut for “any consonant sound”
- `@` is a shortcut for “any vowel sound”
- `%` is a shortcut for “any syllable”, and is equivalant to `#*@#*`
- Otherwise, whatever flies with Python’s regular expression format will work in Sylvia. Just use some common sense, as some things (such as character classes) will be wholly inapplicable to searches in phoneme-space.
Use of this command is as follows:
sylvia> regex {regex tokens}
Consult Carnegie Mellon’s cmudict documentation to learn more about the phoneme set.
Consult the Python docs to learn more about Python’s regex format.
Find words starting with zero or more consonant sounds, followed by the “long E” sound (phoneme IY), followed by zero or more consonant sounds, followed by the “ed” sound (the phoneme sequence EH D):
sylvia> regex #* IY #* EH D Steelhead Seabed Beachhead Retread Behead
Find all six syllable words where the first syllable uses the “short i” sound (phoneme IH), and ends in either the D or P phonemes.
sylvia> regex #*IH%%%%%(D|P) Differentiated Individualized Deteriorated Institutionalized Incapacitated Internationalized Interrelationship Misappropriated Disassociated Discombobulated Insubstantiated
Note here that only five % symbols are needed, as a single vowel sound constitutes a single syllable, and we explicitly call out the first vowel sound via IH.
Find all words that start with the R sound, followed by some vowel, followed by the D sound, followed by another vowel, followed by the NG phoneme:
sylvia> regex R@D@NG Reading Riding Redding Raiding Ridding Reding Rodding Ruding Rawding
If you just want to lookup the pronunciations for a word, you can do that too. This can be a good way to quickly learn the phonemes for a particular sound when constructing queries. Due to cultural and geographic variations in pronunciation, this command can return multiple sequences.
Use of this command is as follows:
sylvia> lookup {word}
sylvia> lookup turkmenistan T ER K M EH N IH S T AE N
sylvia> lookup capture K AE P CH ER
sylvia> lookup tomato T AH M EY T OW T AH M AA T OW
Sylvia can act as a rhyming dictionary, returning words which rhyme with a given word. There are three “rhyme levels”, which define how rhymes are determined.
- perfect lists all words which contain the same sequence of phonemes as the given word, including and following the first vowel in the given pronunciation. Before that vowel, the matched words can contain any sounds.
- default is the same as perfect, except that additional consonant sounds can be interspersed between the matched sequence phonemes.
- loose is similar, except it ignores consonant sounds entirely.
Use of this command is as follows:
sylvia> rhyme {rhyme-level} {word}
rhyme-level can be omitted if default behavior is desired.
There are plans to improve these models by matching phonemes based on their vocal characteristics. For example, all nasal phonemes may be considered matches by default, or all plosive sounds, etc. The behavior documented above is subject to change at any time.
List words which rhyme with “chatter”, using the perfect algorithm.
sylvia> rhyme perfect chatter Matter Latter Batter Mater Platter Scatter Flatter Shatter Hatter Splatter Fatter Patter Antimatter Clatter Spatter Schlatter Blatter Natter Sater Satter Slatter Tatter Mcphatter Chitterchatter Smatter Vanatter Vannater Vatter Vannatter Mcfatter Wildcatter
…using the default algorithm…
sylvia> rhyme chatter After Chapter Matter Master Factors Factor Pattern Faster Matters Webmaster Patterns Adapter Contractor Contractors Disaster Actor Masters Latter Chapters Actors Adapters Lancaster Saturn Adaptor Pastor Thereafter Tractor Scattered Disasters Ticketmaster Napster Laughter Reactor Adaptors Baxter Stratford Blaster Lantern Bastard Maxtor Tractors Shattered Plaster Hereafter Subchapter Batter Broadcasters Antwerp Raptor Mater Platter Scatter Hamster Raster Subcontractor Reactors Pastors Subcontractors Broadcaster Mastered ... many more...
…and using the loose algorithm.
sylvia> rhyme loose chatter After Standard Password Chapter Standards Rather Matter Cancer Answer Master Transfer Answers Factors Factor Pattern Faster Matters Manner Webmaster Patterns Hampshire Adapter Contractor Banner Contractors Alexander Capture Disaster Actor Masters Traveler Latter Albert Chapters Packard Answered Scanner Bachelor Actors Transfers Adverse Amber Tracker Transferred Planner Hacker Commander Adapters Scanners Manufactured Stanford Manufacture Anchor Gathered Travelers Captured Grammar Hazard Anger Gather Lancaster Hammer Manor Programmer Hazards Bradford Madagascar Saturn Banners Passwords Adaptor Pastor Hamburg Ladder Flashers Programmers Planners Thereafter Chancellor Frankfurt Tractor Wagner Hackers Scattered Ballard Disasters Handler Chandler Sanders Ticketmaster Napster Banker Dancer Dancers Jasper Laughter Backward Panthers Captures Bladder Sampler Panther Reactor Stafford Backwards Adaptors Manufactures Glamour Baxter Stratford Blackburn Amherst Blaster Tavern Lambert Fracture ...many, many more...
Sylvia can infer the pronunciation of unknown words using it’s own rule-based text-to-phoneme engine. Don’t expect great performance though – written English is only ostensibly phonetic, and rules-based approaches are not fantastic. Any deep-learning based solution to this problem is likely to beat the snot out of Sylvia’s engine.
Use of this command is as follows:
sylvia> infer {word}
Infer a pronunciation for the word “rooster”, then compare to the value from lookup.
sylvia> infer rooster R UW S T ER sylvia> lookup rooster R UW S T ER
Infer pronunciations for some made-up words.
sylvia> infer rafloy R AE F L OY sylvia> infer rabbilt R AE B IH L T sylvia> infer fliberdoodle F L IH B ER D UW D AH L
Sylvia can lookup words based on normal regular expressions. This command doesn’t touch on anything phonetic, but may be useful in the same use-cases as Sylvia itself.
Use of this command is as follows:
sylvia> lregex {regex tokens}
Find all words which are spelled with a C at the start, a P at the end, and which contain either a T or a D.
sylvia> lregex c.*(t|d).*p Citizenship Craftsmanship Countertop Courtship Catnip Citicorp Conservatorship Catsup Crudup Catchup Colstrip Catnap Cutlip Coltharp
You can ask Sylvia for the popularity of a word. This value depends on the data-source used when compiling the dictionary, but by default, it is the value in Peter Norvig’s word popularity list. Larger values indicate higher popularity (think occurrences, not rank).
Use of this command is as follows:
sylvia> popularity {word}
Find the popularity of a popular, typical, and rare word.
sylvia> popularity I 3086225277 sylvia> popularity green 108287905 sylvia> popularity teutonic 301907
For a list of known issues feature ideas, and links to relevant research and documentation, check out the development notes!