/lazydots

generate pointed hebrew from ascii characters

Primary LanguagePythonMozilla Public License 2.0MPL-2.0

lazydots

A library and utility to generate vocalized Hebrew from ascii. It's sort of just a fun demo of what is possible with the deromanize library with minimal effort.

This software tries to be smarty-pants, which really means it is error-prone, so please report bugs!

You can test out a web implementation here.

ascii hebrew ascii hebrew ascii hebrew
' א H ח ` ע
b ב ch ח p פ
v ב T ט f פ
g ג y י q ק
gh ג k כ r ר
d ד kh כ ts צ
dh ד l ל S
h ה m מ sh
w ו n נ t ת
z ז s ס th ת

The consonants are pretty straight-forward. As far as BeGaD KeFaT letters, the program should actually do the right thing most of the time with dagesh, but if it's giving you a dagesh some place where you don't want it, you can use an "explicitly aspirated" form. v and f in the case of ב and פ respectively, and by adding an h to the end of g, d, k and t. The main time you should need this if you need an aspirated vowel at the beginning of a word, like, if it were following a conjunctive accent or whatevs.

If you want these two characters separately, you can always separate them with a single quote, i.e. t'h = תְה This goes for other digraphs as well, like sh and ts.

Inversely, if you're not getting a dagesh where you want it, you can mark it explicitly with a period. You should never really need to do this. Geminated consonants and BeGaD KeFaT letters at the beginnings of will automatically get their dagesh when it is appropriate. Still, if something goes wonky, you can get what you need (and you will sometimes need to tell it about mappiq).

vowels

ascii hebrew ascii hebrew
: ְ a ַ or ָ
i ִ or ִי o ָ or וֹ
e ֶ u וּ or ֻ
E ֵ    

In addition to the above, hataph segol, hataph petach and hataph qamats can be generated by adding a colon behind /e/, /a/ and /o/ respectively; i.e. e:, a: and o:.

The rules with vowels are a bit strange in writing, but they are pretty easy to use and are designed minimize human error. They are based on norms with open and closed syllables. Also note that silent schwa is not a vowel and will be supplied automatically.

An a will be petach in a closed syllable and qamats in open syllables A is the reverse of that. The exception is the ayi cluster, which is always petach-yod-qamats because of words like מים, בית and anything with a dual ending. This is usually what you want, but you will have to make exceptions for segolats with medial gutterals like פַּחַד, which requeirs you to write pAchad, or those occupational words like גַּנָּב, which will be gannAv. There are, of course, others places that will need to be ammended with capital 'A'.

For lazydots to do its job, you still have to know where things like vocal schwas and doubled letters are, like with בְּבַקָּשָׁה, which is b:vaqqashah, or קָֽטְלָה, qaT:lah. Note that in this last case the presence of vocal schwa also triggers the addition of meteg, so the qamats gadol can be distinguished from kamats qatan. (c.f. חָכְמָה, which we spell chokmah).

An i, o and u have similar behaviors. i is hiriq in closed, hiriq-yod in open, o is qamats qatan in closed, cholem in open, and u is qibbuts in closed and shureq in open -- except in the final syllable of a word. In a final syllable they will always be a long vowel without capitalization (with the exception of ayi, as mentioned above). Again, this is what you want with most words, but there are exceptions, like עִם, `Im תִּשְׁמֹרְנָה, tishmOrnah, וַיָּ֫קָם, wayy<aqOm, etc.

e doesn't get any special "magical" resolution. It's always segol, and E is always tsere. This is because, on the whole, unmarked segol is much more common than unmarked tsere, even in open syllables. The exception is if e precedes a silent aleph, where it will become a tsere unless you tell it not to (with E)

General rule: if it's giving you the wrong vowel, capitalization should fix it.

Marked Vowels: Marked u and i will automatically be determined by vowel length. Likewise e and o will be marked when they occur at the end of a word. Elsewhere tsere-yod will be marked with ei and holem vav is marked with o.. I realize this is unconventional, but I found that I hated looking at it less than I hated looking at ow.

Other marks

< ֫
^ ֑

support for accents only extends to these two at the moment and is expiremental. It may screw up the above-mentioned rules.

lazydots comes with a CLI utility called lzd. You can give it strings you want to convert as args. If you don't do that, it will read from stdin. Output is sent to stdout.

$ lzd "lamah 'attah hitnahagta kmo nudniq"
לָמָה אַתָּה הִתְנָהַגְתָּ כְּמוֹ נֻדְנִיק

Very fancy. I use this in a little script that I bind to a key so I can select text and have it replaced with Hebrew when I hit the binding:

#!/bin/sh
sleep .1
xdotool key --clearmodifiers ctrl+c
xclip -o -selection clipboard | lzd | xclip -selection clipboard
xdotool key --clearmodifiers ctrl+v

This works on linux with X11. Details may vary on other systems.

The lzd command also has one flag: -n/--normalize. This will output the canonical normalized form. At the moment, by default, it outputs the form that looks the best with my fonts.

Please report bugs!

You can also use lazydots as a library for your stupid website or where ever you want it. I may eventually try to build an IBUS engine with it (don't hold your breath).

basically, you do this:

>>> import lazydots
>>> lazydots.make_pointy_text("eizeh TippEsh 'attah")
"אֵיזֶה טִפֵּשׁ אַתָּה"

You can also do make_pointy_line if you want to go line by line or make_pointy if you want to go word by word. You can always used make_pointy_text, but it might be ever so slightly more efficient to use the other functions in certain cases.

Please report bugs!