Translation of Numerals
arunvickram opened this issue · 10 comments
This is gonna be a very silly question, but I was looking around this repository as well the SILE repo and the Fluent standard as well. I was wondering how I could leverage this library to basically optionally translate numerals, either by contributing to it or through some other method. For example, what I want to try and do is mapping 0123456789 to ०१२३४५६७८९ for Devanagari.
From what I could gather, that sort of functionality isn't in the Fluent spec, but there could be some way of baking that functionality in either by tweaking Fluent itself if it isn't there, or adding the custom function feature set into this repository. Let me know what your thoughts are!
At first blush I have to ask, isn't localized number formatting more a job for ICU? Perhaps I'm confused about your use case. Could you sketch out the actual source, output, and workflow you have in mind?
Also is information about the digit system(s) you want to use in CLDR and are you trying to use them in a language that matches that locale data? c.f. sile-typesetter/sile#1248
Functions are one part of the Fluent spec I have not yet implemented here, but that's mainly because I haven't run into a use case one not because they would be hard. Lua is well suited to to that part and they shouldn't be too hard to add.
Also just to throw this out there, I've been aware for a while there is not –but should be– a full set of ICU bindings for Lua. For the SILE project we've created a shim module that surfaces some ICU functions we need to Lua, but we've only added them on an as-needed basis. I've thought before of just spinning up an icu-lua
project that provides full bindings from ICU. If that's where this issue goes (and if that's what is needed to bring this Fluent implementation fully up to spec) then I'll probably dust of that idea and just do it. Many projects would stand to benefit.
Edit: I fired up icu-lua to start playing around with various automatic binding generation systems. If there is a way to get this done without hand writing stub functions for the entire API that would be lovely...
The use case is this. Long story short, I'm basically creating a sort of publishing workflow for authors who speak the numerous south asian languages to be able to more easily publish PDF documents. The unique thing about this is that I specifically focused on the indigenous languages of the subcontinent like the Ho language, Sora language, and Gondi languages, all replete with their own scripts. I basically made it so that they would at least be able to write something in their own language. Ultimately though the goal is to make it easier for people across the world to more easily write and publish in their own language.
This project uses Pandoc and LuaLaTeX (and Babel) to enable people to simply drop in a Markdown file and generate a document in the language of their choice with minimal setup (side note: I also included the fonts in the repository itself so that they don't have to worry about installing it either).
I came across SILE a couple of days ago and I was going through some of the documentation as well as the fluent-lua
repository and thought that some of the stuff that I saw would be a good fit because SILE and/or fluent-lua seems to address a couple of problems that LuaLaTeX and Babel have. For example, even though Babel's language system can be extended by placing .ini
files in the LaTeX babel package config directory, you can't really handle the grammatical nuances of certain languages (especially non-VO languages) because the ini files effectively a simple KV map. On top of that SILE seems a bit more accessible w.r.t to the layout engine and I believe easier to typeset things like newsletters, plays, and other stuff as well, but I could be wrong.
Even if I don't end up necessarily integrating SILE with this project, I still think having fluent-lua packaged as a LuaLaTeX package would be pretty useful nonetheless.
A couple of thoughts in no particular order:
-
I have lots of time and code invested in LuaLaTeX (and XeLaTeX). I've written more lines of LaTeX code than the entire SILE project code. In other words you can write an entire layout engine in less code than I've had to write in
.sty
files trying to retrofit my needs into LaTeX. I say that not so much to boast, just to confirm that I don't hate LaTeX because I don't understand it (which would be understandable because it can be pretty obtuse). I discovered SILE about 7 years ago when it was in a lot rougher shape and it felt like freedom after fighting with LaTeX so much. -
About 95% of my work has been in Turkish and a long tail of other languages, not English. As you note, Babel / Polyglossia are not sufficient localization systems for all languages because they are inherently key/value maps with some string formatting worked in. That's just not sufficient for real localization and never will be. The only thing you can do with babel / polyglossia is hack around it by avoiding complex cases and putting the burden on translators to fudge without the tools to get the job done right.
-
I have no interest in developing anything for LuaLaTeX at this point. I still have a few projects stuck in it but am very invested in SILE going forward and won't be going back. I thing there is still plenty of space for it in the market, I just don't want to invest any more of my time into the TeX ecosystem. I'm perfectly happy for others to do so. I wrote
fluent-lua
as a stand-alone module rather than baking the functionality into SILE because I figured it would eventually be useful for other projects. I hope in provides a robust implementation of the Fluent spec in an idiomatic way that is easy to use from Lua. If somebody wants to come up with a LuaLaTeX package that wraps Fluent-lua and exposes Fluent functionality to the LuaLaTeX internals that would be awesome. I'll support any changes that need to be made to Fluent-lua itself to be robustly useful as a stand alone library. The TeX end of things will be somebody else's job ;-) -
Your
pandoc-spiceland
project is interesting, and I support the goal of providing good typesetting tools and publishing workflows to minority languages. That being said it honestly looks to me like you are in the early stages of re-inventing CaSILE. Some of what you have there look a lot like where I started. Of course I've since headed in a bit different direction with different assumptions about inputs and goals, but I do have to wonder whether it might work for your goals to contribute the related localizations to CaSILE and then just have template documents for each language to get people started similarly to what you have done. -
It all of this I still don't see where exactly you need to cast Arabic numerals to Devanagari equivalents or where that would end up in documents or localization data. Where does that need come into play?
So I'll answer your questions in no particular order as well 😄
-
I don't necessarily need to convert Arabic numerals to Devanagari or various other numerals, but I'd like to have the option for writers and publishers to toggle it whenever they see fit. At the end of the day, it's ultimately not up to me to decide what writers and publishers of those languages want to do, but I think having the ability to toggle it is important.
-
I figured you might have been confronted by some of the same frustrations that I had fighting TeX, and for the record I wasn't expecting you to try and deal with TeX after all of that 😆 The reason I mentioned potentially writing a package as an alternative was in case I decided not to integrate
pandoc-spiceland
with SILE. For right now, I have a decent setup going on for writing articles and larger texts (not necessarily books though) in those particular languages with LaTeX right now. But I am more than willing to help contribute to thefluent-lua
functionality because I ultimately think it's a far more robust solution than Babel. But at the moment I haven't fully decided what direction to takepandoc-spiceland
in quite yet. -
I did eventually come across CaSILE and I did note the massive similarities our workflows. As for the actual translations of many of these languages, I only fluently speak one of those South Asian languages, Tamil, and so vis-à-vis contributing back to the translations my capabilities are limited. But I'm definitely going to play around with CaSILE and see how I feel about it. There's also still a lot up in the air about how I want to carry
pandoc-spiceland
forward. -
I thought about writing my own layout engine just as a thought experiment as an alternative to TeX and I wanted to get your experiences on that.
Also, I'll check out the ICU C/C++ library and play around with the repo that you've set up.
Probably the one downside about CaSILE for you is I have (to date) baked in the assumption that the project sources are tracked in Git and that the main thing in the project is the document/book. It does not currently play nicely with "loose" source files not tracked in a repository. It's been moving away from that assumption and I'd be happy to consider freeing it up so that it could be invoked on a single markdown source with extra arguments for the toolkit to look for (instead of assuming one in the project), but there hasn't been a need for that workflow yet.
I could easily see retooling a little bit so your tooling was (in CaSILE terms) a publisher toolkit with a default.sile
instead of your default.latex
, and a Lua module providing a default layout it a class any relevant utility functions for your use cases. I would stuff all the fonts in a single location and let fontconfig
find them (using a path your toolkit injects), then use the Lua module to change the default font based on the language, etc.
How available is Docker vs. "babashka + texlive + pandoc" for your target audience?
Docker is likely to be the more prominent tool if I'm being candid. Although I have thought about wrapping pandoc-spiceland
in a Docker container myself, but I haven't necessarily gotten around to it quite yet.
I'm not entirely sure though I'd want the font to be controlled by Lua 100% tho. For bigger languages with a wider variety of fonts it might befit the publisher or writer to be able to more easily toggle that, and I'm essentially trying to streamline this process as much as possible.
I'm also currently working on a Markdown editor using Tauri and React.js that hopefully aims to wrap some of this functionality within the editor itself.
I was suggesting you setup a document class in Lua that set default fonts and handled any unique expectations of your documents. That doesn't mean your end users would have to switch to Lua to control fonts. You could easily make them part of the document meta data in a YAML head block like you are already doing and have the document class look for those keys and change settings as appropriate.
Edited: Okay I've played around with CaSILE for a little bit and I'll open an issue regarding some of the stuff that I'm having there with it there.