A phonetic pattern matching library for Common Lisp.
This library is intended to replace the Sylvia library for Python.
(let
((dict (from-cmudict "./cmudict")))
(pronounce-word dict "tomato")
;; (#<PRONUNCIATION T AH M AA T OW> #<PRONUNCIATION T AH M EY T OW>)
(regex-search dict "P .* EY T OW #+")
;; Words starting with the P sound, followed by anything, and ending with the
;; EY T OW sequence, plus at least one more consonant.
;; (("potatoes" #<PRONUNCIATION P AH T EY T OW Z>)
;; ("plato's" #<PRONUNCIATION P L EY T OW Z>))
(regex-search dict "#<v,,f> %% NG")
;; Two syllable words beginning with a voiced fricative and ending with NG
;; (("zooming" #<PRONUNCIATION Z UW M IH NG>)
;; ("zapping" #<PRONUNCIATION Z AE P IH NG>)
;; ("voicing" #<PRONUNCIATION V OY S IH NG>)
;; ("vesting" #<PRONUNCIATION V EH S T IH NG>)
;; ...
(generate-regex 'rhyme (first (pronounce-word dict "roses")))
;; ".* OW Z IH Z "
(generate-regex 'rhyme (first (pronounce-word dict "roses")) :loose t)
;; ".* OW #* Z #* IH #* Z #* "
(generate-regex 'rhyme (first (pronounce-word dict "roses")) :near t)
;; ".* OW Z [ AE EH IH IY ] Z "
(the-words (find-metapattern dict 'rhyme "roses"))
;; Words that perfectly rhyme with "roses"
;; ("supposes" "roses" "rose's" "proposes" "primroses" "presupposes" "poses"
;; "overexposes" "opposes" "noses" "juxtaposes" "imposes" "hoses" "forecloses"
;; "exposes" "dozes" "disposes" "discloses" "decomposes" "composes" "closes"
;; "bulldozes")
(the-words (find-metapattern dict 'rhyme "roses" :loose t))
;; Words that loosely or perfectly rhyme with "roses"
;; ("supposes" "roses" "rose's" "rolls's" "proposes" "primroses" "presupposes"
;; "poses" "overexposes" "opposes" "noses" "juxtaposes" "joneses" "jones's"
;; "imposes" "hoses" "holmes's" "forecloses" "exposes" "dozes" "doles's"
;; "disposes" "discloses" "decomposes" "composes" "closings" "closes" "bulldozes")
(the-words (find-metapattern dict 'rhyme "roses" :loose t :near t))
;; ("supposes" "rosie's" "roses" "rosecrans" "roseanne's" "rose's" "rolls's"
;; "proposes" "primroses" "presupposes" "poses" "overexposes" "opposes" "noses"
;; "juxtaposes" "joneses" "jones's" "imposes" "hoses" "holmes's" "grozny's"
;; "forecloses" "exposes" "dozes" "doles's" "disposes" "discloses" "decomposes"
;; "composes" "closings" "closes" "bulldozes" "bozell's")
(the-words (find-metapattern dict 'consonance "babble"))
;; Words with consonance against "babble"
;; ("burble" "bubel" "bubbly" "bubble" "bobble" "biebel" "bibler" "bible" "bauble"
;; "babula" "babler" "babel" "babbler" "babble")
(the-words (find-metapattern dict 'consonance "babble" :loose t))
;; Words with a loose consonance against "babble"
;; ("obtainable" "observable" "objectionable" "butterball" "burnable" "burble"
;; "bumbly" "bumble" "bumbalough" "buildable" "bubel" "bubbly" "bubble"
;; "brumbelow" "brightbill" "brechbill" "breakable" "bramble" "brambila"
;; "brakebill" "brackbill" "botshabelo" "bookmobile" "bonnibelle" "bonnibel"
;; "bobble" "bobadilla" "bluebottle" "bluebell" "blankenbeckler" "blackball"
;; "biodegradable" "billable" "biebel" "biddable" "biblical" "bibler" "bible"
;; "berkebile" "believable" "bearably" "bearable" "beachball" "bauble"
;; "basketball" "baseball" "barboursville" "barbella" "barbell" "barbanel"
;; "barbagallo" "bankable" "babula" "babler" "babel" "babbler" "babble"
;; "abominable")
(the-words (find-metapattern dict 'assonance "ivanhoe"))
;; Words with assonance against "ivanhoe"
;; ("zydeco" "xylophone" "virazole" "styrofoam" "microscopes" "microscope"
;; "microphone" "kayapo" "ivanhoe" "ivaco" "isotopes" "isotope" "isentrope"
;; "idaho's" "idaho" "gyroscopes" "gyroscope" "dynamo" "dialtone" "diagnosed"
;; "diagnose" "cyclostomes" "cyclostome")
(the-words (find-metapattern dict 'alliteration "phone"))
;; Words with alliteration agaisnt "phone"
;; ( ... "foam" "phoning" "fairway" "philistine" ...)
(pronounce-utterance dict "Eat it!")
;; #<PRONUNCIATION IY T IH T>
(the-words (find-metapattern dict 'rhyme (pronounce-utterance dict "Eat it!") :loose t))
;; ("restricts" "restrict" "elitists" "elitist" "defeatist" "betwixt")
)
This library is in-progress, and each feature is at varying degrees of readiness.
Status Symbol | Meaning |
---|---|
💡 | Planning stage. |
⛏ | Initial groundwork started. Usually, this just means that it’s already done in Sylvia’s codebase. |
🚧 | Currently under active implementation. |
✅ | Done |
cl-phonetic can search a phonetic dictionary for words whose pronunciations match a phonetic regular expression. These regex are similar to Perl in syntax, but, have nothing to do with ASCII or Unicode character sets. Instead, the alphabet of the language tested by these regex consists only of phonemes.
Phoneme Literals
In a phonetic regex, phoneme literals are defined according to the ARPABET, as it was used by cmudict. A full list of ARPABET phoneme encodings from that link is reproduced here.
Phoneme | Example | Translation |
---|---|---|
AA | odd | AA D |
AE | at | AE T |
AH | hut | HH AH T |
AO | ought | AO T |
AW | cow | K AW |
AY | hide | HH AY D |
B | be | B IY |
CH | cheese | CH IY Z |
D | dee | D IY |
DH | thee | DH IY |
EH | Ed | EH D |
ER | hurt | HH ER T |
EY | ate | EY T |
F | fee | F IY |
G | green | G R IY N |
HH | he | HH IY |
IH | it | IH T |
IY | eat | IY T |
JH | gee | JH IY |
K | key | K IY |
L | lee | L IY |
M | me | M IY |
N | knee | N IY |
NG | ping | P IH NG |
OW | oat | OW T |
OY | toy | T OY |
P | pee | P IY |
R | read | R IY D |
S | sea | S IY |
SH | she | SH IY |
T | tea | T IY |
TH | theta | TH EY T AH |
UH | hood | HH UH D |
UW | two | T UW |
V | vee | V IY |
W | we | W IY |
Y | yield | Y IY L D |
Z | zee | Z IY |
ZH | seizure | S IY ZH ER |
When they occur in a phonetic regex, these phoneme literals should be space delimited. For example, K AE T
is a phonetic regex which matches the English word “cat”.
Since these regex are Perl-like, K AE .*
is also a valid phonetic regex, and matches words like “cat”, “Canberra”, “cathode”, etc.
Phoneme Class Expressions
cl-phonetic
further extends Perl syntax by introducing a new facility for defining classes and sequences of phonemes. To start;
#
matches any single consonant phoneme@
matches any single vowel phoneme%
matches any single syllable
Both the #
and @
class symbols may optionally accept arguments which further constrain matches. These arguments consist of comma delimited characters within angle brackets. For example, #<v,,f>
which matches only voiced, fricative consonants.
You need only supply as many arguments as desired, and can leave fields empty as needed. For example, the following class definitions are all valid, and all compile to the same phoneme sets; @
, @<>
, @<,>
, and @<,,>
.
Consonant Class Options
For consonant classes (the #<,,>
pattern), up to three arguments can be specified;
- First, a single character which can restrict matches based on voicing.
- Second, sequence of characters which restricts matches based on place of articulation.
- Third, a sequence of characters which restricts matches based on method of articulation.
When multiple characters are supplied for a single parameter, the resulting matches are a union over those characters. That is, there’s an implicit OR
over your arguments.
Consonant voicing arguments:
Character | Restricts Matches To |
---|---|
v | Voiced |
u | Unvoiced |
Consonant place-of-articulation arguments
Character | Restricts Matches To |
---|---|
a | Alveolar |
b | Bilabial |
d | Dental |
g | Glottal |
l | Labio-dental |
p | Post-alveolar |
t | Palatal |
v | Velar |
Consonant method-of-articulation arguments
Character | Restricts Matches To |
---|---|
a | Affricate |
f | Fricative |
l | Lateral |
n | Nasal |
p | Plosive |
x | Approximant |
Examples:
Phoneme Class Definition | What It Matches |
---|---|
# | All consonants |
#<,,> | All consonants |
#<v> | All voiced consonants |
#<v,,> | All voiced consonants |
#<,,p> | All plosive consonants |
#<v,,p> | All consonants which are both voiced and plosive |
#<,bd,> | All consonants which are either bilabial or dental |
#<,,fa> | All consonants which are either fricative or affricate |
#<u,bd,fa> | All consonants which are unvoiced, and also either bilabial or dental, and also either fricative or affricate |
Vowel Class Options
For vowel classes (the @<,,>
pattern), three parameters may also be specified;
- First, height
- Second, backness
- Third, roundedness
The first two of these categories are fairly fluid, and so are encoded as numbers. As with consonants, when multiple characters are supplied for a single parameter, the resulting matches are a union over those characters. That is, there’s an implicit OR
over your arguments.
Vowel height arguments:
Character | Restricts Matches To |
---|---|
1 | Open |
2 | Near Open |
3 | Open Mid |
4 | Mid |
5 | Close Mid |
6 | Near Close |
7 | Close |
Vowel backness arguments:
Character | Restricts Matches To |
---|---|
1 | Front |
2 | Central |
3 | Back |
Vowel roundedness arguments:
Character | Restricts Matches To |
---|---|
r | Rounded |
u | Unrounded |
Examples:
Phoneme Class Definition | What it Matches |
---|---|
@ | All vowels |
@<,,> | All vowels |
@<,,r> | All rounded vowels |
@<12,,u> | All vowels which are unrounded and either open or near open height. |
@<,23> | All vowels with either a central or back backness |
Diphthongs and the r-colored phoneme, for now, are excluded whenever any restrictions are applied. They will only match a plain @
, or, their associated phoneme literals.
cl-phonetic can function as a rhyming dictionary by way of phonetic metapatterns. Other literary devices, like assonance, consonance, and alliteration, can also be queried.
A phonetic metapattern is a function which transforms a pronunciation (the phoneme sequence associated with a word) into a regular expression. This resulting regular expression implements the given metapattern over the given word.
rhyme
The 'rhyme
metapattern applied to a word word
produces a regular expression which matches words that rhyme with word
. A rhyming word is defined here as any phoneme sequence whose phonemes match exactly after the first vowel phoneme. With the :loose
option, additional consonant phonemes may be interspersed.
consonance
The 'consonance
metapattern produces a regular expression which matches all words containing the same sequence of consonant phonemes as the target word. Vowel phonemes are ignored. With the :loose
option, additional consonants may be interspersed.
assonance
The 'assonance
metapattern produces a regular expression which matches all words containing the same sequence of vowel phonemes as the target word. Consonant phonemes are ignored. With the :loose
option, additional vowels may occur before or after the matched sequence.
alliteration
The 'alliteration
metapattern produces a regular expression which matches all words which begin with the same phoneme as the target word.
Arbitrary character sequence to phoneme sequence mapping. Sylvia has a quirky ruleset for this, which works fairly well. But it might be more fun to fit a transducer instead.
Allow searches to be applied in order of word popularity, and limit by either popularity threshold or total match count. Helps to prevent obscure words cluttering results.
Calculating phoneme N-grams, at the bare minimum. Basically a quick-path for processing large corpus.
Currently, only cmudict-like text files are supported.
(defparameter *dict* (from-cmudict #P"cmudict"))
*DICT*
pronounce-word
produces a list of pronunciation
objects.
Sometimes, there’s just one pronunciation in it:
(pronounce-word *dict* "creepy")
(#<PRONUNCIATION (K R IY P IY)>) T
Sometimes, there’s more:
(pronounce-word *dict* "tomato")
(#<PRONUNCIATION (T AH M AA T OW)> #<PRONUNCIATION (T AH M EY T OW)>) T
regex-search
returns an alist of words (strings) and pronunciation lists.
(regex-search *dict* "K AE T")
(("katt" #<PRONUNCIATION (K AE T)>) ("kat" #<PRONUNCIATION (K AE T)>) ("catt" #<PRONUNCIATION (K AE T)>) ("cat" #<PRONUNCIATION (K AE T)>))
the-words
takes an alist of that form and returns list a list of words.
(the-words (regex-search *dict* "K AE T"))
("katt" "kat" "catt" "cat")
The regex are generally Perl-like. Searching is done as “matches”, meaning that the word’s pronunciation must match the entire regex. Add .*
to both ends if you want a scanning behavior.
(the-words (regex-search *dict* ".* K AE T .*"))
("yekaterinburg" "wildcatting" "wildcatters" "wildcatter" "wildcats" "wildcat" "wicat" "tomcat" "thundercats" "thundercat" "scattershot" "scattering" "scattergory" "scattergories" "scattergood" "scattered" "scatter" "scatology" "scatological" "scat" "pussycats" "pussycat" "polecats" "polecat" "piscataway" "muscat" "metlakatla" "mchatton" "mcatee" "kotsonis's" "kotsonis'" "kotsonis" "kitcat" "kikatte" "katzman" "katzin" "katzer" "katzenstein" "katzenberger" "katzenberg's" "katzenberg" "katzenbach" "katzen" "katz" "kattner" "katt" "katsushi" "katsaros" "katsanos" "kats" "katmandu" "katashiba" "kat" "copycatting" "copycats" "copycat" "concatenation" "concatenating" "concatenates" "concatenated" "concatenate" "catwoman" "catwalk" "catty" "catton" "catto" "cattlemen's" "cattlemen" "cattle" "catterton" "catterson" "catterall" "cattanach" "catt" "catskills" "catskill" "cats" "catron" "catrett" "catrambone" "caton" "catoe" "catnip" "catnap" "catlin" "catlike" "catlett" "catledge" "catkins" "catfish" "caterwaul" "caterpiller's" "caterpiller" "caterpillars" "caterpillar's" "caterpillar" "category" "categorizing" "categorizes" "categorized" "categorize" "categorization" "categories" "categorically" "categorical" "catechism" "catcalls" "catcall" "catbird" "catatonic" "catastrophic" "cataracts" "cataract" "catapults" "catapulting" "catapulted" "catapult" "catamount" "catalyzed" "catalyze" "catalytic" "catalysts" "catalyst's" "catalyst" "catalonian" "catalonia" "cataloguing" "catalogues" "catalogued" "catalogue" "catalogs" "cataloging" "catalogers" "cataloger" "cataloged" "catalog" "catalina" "catalans" "catalan" "catala" "catain" "catacombs" "catacomb" "cataclysmic" "cataclysm" "cat-o-nine-tails" "cat-6" "cat-4" "cat-3" "cat-2" "cat-1" "cat's" "cat" "bobcats" "bobcat" "bacot")
Again, anything that works with Perl should work here. .?
translates to “optionally, a single phoneme of any kind”.
(the-words (regex-search *dict* ".? AE T"))
("vat" "that" "tat" "shatt" "schadt" "sat" "ratte" "rat" "patt" "pat" "nat" "matte" "matt" "mat" "lat" "katt" "kat" "jagt" "hatt" "hat" "gnat" "gatt" "gat" "fat" "dat" "chat" "catt" "cat" "bhatt" "batte" "batt" "bat" "at")
And so on.
Consonants are encoded with #
symbols.
(the-words (regex-search *dict* "# AE T"))
("vat" "that" "tat" "shatt" "schadt" "sat" "ratte" "rat" "patt" "pat" "nat" "matte" "matt" "mat" "lat" "katt" "kat" "jagt" "hatt" "hat" "gnat" "gatt" "gat" "fat" "dat" "chat" "catt" "cat" "bhatt" "batte" "batt" "bat")
They can be further restricted by voicing, place of articulation, and manner of articulation.
For example, here are the words ending with “AE T” that begin with a voiced, fricative consonant:
(the-words (regex-search *dict* "#<v,,f> AE T"))
("vat" "that")
And the words ending with “AE T” that begin with a bilabial, plosive consonant:
(the-words (regex-search *dict* "#<,b,p> AE T"))
("patt" "pat" "bhatt" "batte" "batt" "bat")
And the words ending with “AE T” that begin with a bilabial or labio-dental consonant:
(the-words (regex-search *dict* "#<,bl,> AE T"))
("vat" "patt" "pat" "matte" "matt" "mat" "fat" "bhatt" "batte" "batt" "bat")
All single syllable words beginning with a “B” phoneme, a single vowel, and a “D”.
(the-words (regex-search *dict* "B @ D"))
("byrd" "burd" "budde" "budd" "bud" "boyde" "boyd" "bowed" "booed" "bode" "bird" "bide" "bid" "beede" "bede" "bed" "bead" "bayed" "bawd" "baud" "bade" "bad" "baade")
The previous expression, restricted to vowels with a height between open and mid, inclusive.
(the-words (regex-search *dict* "B @<1234,,> D"))
("budde" "budd" "bud" "bed" "bawd" "baud" "bad" "baade")
generate-regex
creates a phonetic regular expression from a predefined metapattern and a word.
(generate-regex 'rhyme (first (pronounce-word *dict* "Candor")))
.* AE N D ER
Searching for this regex yields words that perfectly rhyme with “Candor”.
(the-words (regex-search *dict*
(generate-regex 'rhyme
(first (pronounce-word *dict* "Candor")))))
("zander" "wicklander" "vandevander" "vander" "telander" "swartzlander" "subcommander" "standre" "stander" "stadtlander" "slander" "skenandore" "sjolander" "scalamandre" "santander" "sandor" "sander" "salamander" "rosander" "rander" "philander" "pander" "oleander" "nederlander" "meander" "mcalexander" "mander" "mainlander" "lysander" "leander" "landor" "lander" "highlander" "hander" "grander" "glander" "gerrymander" "gander" "evander" "coriander" "commander" "candor" "calamander" "bystander" "brander" "blander" "bander" "aulander" "ander" "alexander" "aleksandr" "aleksander")
But, if all you’re going to do is search for the generated regex, just use find-metapattern
…
find-metapattern
wraps the process of generating a regular expression & searching it:
(the-words (find-metapattern *dict* 'rhyme "Turkey" :loose t))
("yerkey" "yerkes" "yaworski" "xerxes" "workweeks" "workweek" "worksheets" "worksheet" "tyburski" "twersky" "turski" "turnkey" "turkeys" "turkey's" "turkey" "swirsky" "swiderski" "sturkie" "stachurski" "sircy" "shirkey" "quirky" "purkey" "podgurski" "pirkey" "persky" "perky" "perkey" "pearcy" "murky" "mirsky" "merkley" "merkey" "kuberski" "koperski" "kirksey" "kirkley" "kirkey" "kirkby" "kasperski" "jerky" "hirschfield" "gursky" "gurski" "girsky" "gerski" "gerke" "figurski" "dworsky" "durkee" "burkley" "burkey" "burkeen" "birky" "birkey" "bertke" "berkley" "berklee" "berkey" "berkeley's" "berkeley" "anarchy" "aldercy" "albuquerque")
test-metapattern
just tests whether a metapattern holds over two words.
Here, it does;
(test-metapattern *dict* 'alliteration "Xenon" "Czar")
(("Czar" #<PRONUNCIATION (Z AA R)>))
And here, it does not;
(test-metapattern *dict* 'rhyme "Wallet" "Stanford")
NIL