Right-to-Left (RTL) support for Hebrew and Arabic
Opened this issue ยท 60 comments
Please add Right-to-Left (RTL) support for languages like Hebrew and Arabic...
Something like:
doc.rtl(true);
doc.text('...', {rtl: true});
I don't know much about RTL languages, but it seems to me that you could reverse the string and align to the right to get this working (I'm probably wrong here). However, if the text contains a combination of LTR and RTL text, then we'll need an implementation of the Unicode Bidi Algorithm. Those who know more than I do, please fill me in. I'd love to see this implemented so PDFKit is more widely usable.
A separate issue is vertical text support (e.g. Japanese), which I'd also like to see and which has its own challenges.
Arabic also has its own challenges because letters get a different shape depending on their position in a word (beginning, middle, end) so this is anything but easy :)
Interesting, I assume there is some sort of algorithm out there to determine this? Starting to sound like a lot of work.
I'm just parachuting in, but isn't this something like what you need:
https://github.com/mathiasbynens/node-unicode-data
Why re-implement unicode algorithms?
EDIT: Wait a moment, this seems to be quite far from what's needed, my bad. But isn't there a ready implementation?
No, that's just unicode character metadata, not any actual algorithms. RTL support will require an implementation of the Unicode Bidi Algorithm. Shaping of Arabic text with contextual substitutions is a separate problem to solve.
Yeah, this library from Twitter might work but I haven't tried it.
I'll go ahead and fork pdfkit
, and see what i can come up with. Any pointers for where to start and how you'd approach it?
I'd try Twitter's library and see if it produces the results you expect. Sorry for being so ignorant on this, but does it work to run the text through that library, then send the result to the PDFKit doc.text
method?
I found something that might be even more to the point:
https://github.com/cscott/node-icu-bidi
Yeah, the problem is that node-icu-bidi is a node C++ module, but PDFKit also works in the browser, so everything must be pure JavaScript. If it works for your needs, feel free to use it, but PDFKit won't take on a non-JS dependency.
I understand, so an acceptable solution would be to extract the BIDI
algorithm from the twitter library, correct?
On Tue, Aug 5, 2014 at 8:00 PM, Devon Govett notifications@github.com
wrote:
Yeah, the problem is that node-icu-bidi is a node C++ module, but PDFKit
also works in the browser, so everything must be pure JavaScript. If it
works for your needs, feel free to use it, but PDFKit won't take on a
non-JS dependency.โ
Reply to this email directly or view it on GitHub
#219 (comment).
maybe i'm late to the party; just wanted to mention i've implemented (a looong ago) a similar solution in DOS (with the old fashioned 16x16 bitmap fonts); but i think the same approach can be applied here
1- reorder the input string using the Bidi algorithm
2- reshape by applying single glyph substitution depending in the context (beginning, middle, end of the word or standalone glyph).
3- ligatures
4- inverse the alignment (possibly using a RTL flag); if this is supported then a more appropriate naming of alignment options should be : leading/trailing instead of right/left
1 and 4 are the 'easy parts'; for 2 and 3 it's another story: for the OpenType fonts i think there is a GSUB table that can be used for this; but for other font types the only option i think is to implement the specific algorithm for each script (as you said this is a lot of work)
it seems another solution to Arabic shaping is the use of 'Text based Shaping' that transforms the characters on the string level rather than in the Glyph level (further details are there). And it seems there is already an implementation of this kind in Javascript by the ibm-js team. From the sources it appears that the text engine performs a bunch of operations at the character level:
1- Bidi reordering
2- Text shaping (AFAIK applies only to Arabic scripts)
3- symmetrical swapping (replace [(..
with their symmetrical RTL )]...
)
4- Number shaping (replace 'Western-Arabic)' numbers 0, 1,2 ... with their Eastearn Arabic counterparts ู โ,ูก,โูข ...โโ)
This can be also a possible fallback to non OpenType fonts which doesn't have a GSUB table
If you prioritize Bidi reordering, and symmetrical swapping, it's enough for Hebrew support.
While technically Hebrew has characters that look different when they're in the end of the word, you shouldn't care about it because unicode defines them as separate characters.
Text & number shaping can be added later for Arabic support.
I found the following infos related to this topic. Python Arabic Reshaper is a library which can be used in cases when native Arabic support is not available. The readme contains a good explanation of the issue and the solution. This library has been ported to Javascript.
On the BIDI topic I found this test program written in Javascript.
There are GSUB (Glyph substitution) tables in font files for Complex languages.
This link explains those tables with example.
https://www.microsoft.com/typography/otfntdev/arabicot/features.aspx
PDFKIT still has a problem with RTL
any updates? @devongovett
Hi! @devongovett any update on RTL support? Question 2, is this project dead?
Please don't be dead :(
pdfkit has more functionality than jsPDF. jsPDF doesn't have full unicode support but pdfkit does. The project and its committers deserve the praise. For RTL, right-aligned text works very well. However, when we want to use columns, things change. The need is just to start from right-most column through the left most column. @devongovett we don't need anything except this I think because the RTL text has its RTL way, no need to reverse the strings. (same for LTR inside RTL)
I was able to manually handle Hebrew with code like:
npm install twitter_cldr
import * as TwitterCldrLoader from "twitter_cldr";
const TwitterCldr = TwitterCldrLoader.load("en");
class ... {
private isHebrew(text: string) {
var position = text.search(/[\u0590-\u05FF]/);
return position >= 0;
}
private maybeRtlize(text: string) {
if (this.isHebrew(text)) {
var bidiText = TwitterCldr.Bidi.from_string(text, { direction: "RTL" });
bidiText.reorder_visually();
return bidiText.toString();
} else {
return text;
}
}
}
Just pass all text that may be in Hebrew through the maybeRtlize function.
It's not perfect and I only tested it for Hebrew, but it seems to work pretty good. If you also need right alignment, use something like isHebrew(myText) ? { align: "right" } : null
for alignment.
The problem is that if the text wraps onto multiple lines, the first word of the text will be on the last line, which is wrong. There needs to be more logic added to handle line breaks.
is there any solution to support urdu and arabic in pdfkit till now.
Simply reversing the text before it goes to pdfkit seems to work for both Hebrew and Arabic (I'm just eyeballing the text however since I speak neither)
const isHebrew = (text) => {
return text.search(/[\u0590-\u05FF]/) >= 0;
};
const isArabic = (text) => {
return text.search(/[\u0600-\u06FF]/) >= 0;
};
const rightToLeftText = (text) => {
if (isHebrew(text) || isArabic(text)) {
return text.split(' ').reverse().join(' ');
} else {
return text;
}
};
rightToLeftText('ุฃูุง ุฃุชุญุฏุซ ุงููุบุฉ ุงูุนุฑุจูุฉ');
rightToLeftText('ืื ื ืืืืจืช ืขืืจืืช');
You have to pay attention that Arabic script has symbols that combine with neighboring symbols based on what they are and where they are.
return text.split(' ').reverse().join(' ');
What of non-RTL text? Like a mix between Hebrew and numbers. Or English and Arabic. Consider the following example:
ืืฉ ืื 500 tokenืื ืฉื globus2000.
By the way, looks like GitHub is treating this wrong xD
return text.split(' ').reverse().join(' ');
What of non-RTL text? Like a mix between Hebrew and numbers. Or English and Arabic. Consider the following example:
ืืฉ ืื 500 tokenืื ืฉื globus2000.
By the way, looks like GitHub is treating this wrong xD
Fair point but it doesn't actually apply in my use case. Perhaps splitting it into RTL and LTR chunks and then only reversing the RTL chunks would work? Worth a shot, especially since none of the other solution in here worked for me.
Thatโs pretty much what the Unicode bidi algorithm does: http://www.unicode.org/reports/tr9/
Thatโs pretty much what the Unicode bidi algorithm does: http://www.unicode.org/reports/tr9/
Will the bidi algorithm be embedded in pdfkit?
Sure, if someone wants to implement it.
There's a JS implementation of tr9
https://github.com/bbc/unicode-bidirectional
not sure how accurate it is.
Simply reversing the text before it goes to pdfkit seems to work for both Hebrew and Arabic (I'm just eyeballing the text however since I speak neither)
const isHebrew = (text) => { return text.search(/[\u0590-\u05FF]/) >= 0; }; const isArabic = (text) => { return text.search(/[\u0600-\u06FF]/) >= 0; }; const rightToLeftText = (text) => { if (isHebrew(text) || isArabic(text)) { return text.split(' ').reverse().join(' '); } else { return text; } }; rightToLeftText('ุฃูุง ุฃุชุญุฏุซ ุงููุบุฉ ุงูุนุฑุจูุฉ'); rightToLeftText('ืื ื ืืืืจืช ืขืืจืืช');
This is exactly what I am looking for. Just a bit improvement:
For RTL languages like persian (as I use it), add a space to the end of the string:
text.split(' ').reverse().join(' ') + ' ';
This will work like a charm!!!
Remember that if your string have special characters (e.g. ":") at the end, put it before added white space.
Just a note for whoever is still stuck on this that reversing the text is not a good idea. It will reverse things like numbers and various other things that should not be reversed. 123456
might result in being reversed to 654321
Use a library meant for this, like TwitterCldr, see #219 (comment)
Just a note for whoever is still stuck on this that reversing the text is not a good idea. It will reverse things like numbers and various other things that should not be reversed.
123456
might result in being reversed to654321
Use a library meant for this, like TwitterCldr, see #219 (comment)
Note: We are reversing array of words, not array of characters!!!
I am trying twitterCLDR and problem still persists. In my case, problem isn't about character ordering, it is about white spaces. If you are using linux, as I, just install suitable language package, this will resolve character ordering and it will not be a problem anymore. TwitterCLDR is good for white space ordering but it operates character ordering simultaneously, and it is not good. The best manipulation is reverse() for me.
@weera-tech the actual letters need to be reversed too. Not just the word order is supposed to be reversed in rtl writing.
@weera-tech the actual letters need to be reversed too. Not just the word order is supposed to be reversed in rtl writing.
You are right, but I said that first install suitable language package, in RTL direction, you have to set align to right. Therefore it will have conflict with TCLDR character ordering. simple: -1 * -1 = 1 :)
I'm not sure what sort of mechanism would actually reverse characters for you, but not words, considering pdfkit has no rtl support whatsoever. Perhaps something weird is happening on Linux. I'm using pdfkit in the browser with webpack.
In my experience, and I have a production app using this approach with TwitterCLDR and pdfkit, simply reversing words resulted in support tickets being issued for exactly this problem. Words where in the correct order, but letters were in the wrong order.
Ooops!!!
You are using it in client-side? I am using server-side. Probably this is our difference.
The only correct implementation will be the Unicode bidi algorithm. Anything else, especially reverse(), will be incorrect.
There is a recent WASM build of the HarfBuzz engine which is a text shaping engine used by Firefox Chrome, and others.
https://github.com/harfbuzz/harfbuzzjs
It does support Unicode bidi algorithms among other things. I believe it could be integrated with pdfkit to solve RTL once and for all.
There is a demo here: https://harfbuzz.github.io/harfbuzzjs/
Some discussion about it being used to solve RTL issues for Photopea, which is a very popular online image editor: harfbuzz/harfbuzzjs#10
Unfortunately I'm not familiar at all with pdfkit's text rendering, but perhaps someone could look into it.
Hey,
Any news with RTL support?
@devongovett from my limited understanding of fontkit it seems that it does indeed support rtl.
I found this site and I was able to see rtl text being rendered properly.
https://fontkit-demo.now.sh/
Also from what I understand, pdfkit is based on fontkit so what is stopping this from working?
@andreialecu because RTL support is more than glyph rendering
The only proper way to render rtl language is
- determine flow of the paragraph (rtl or ltr)
- run text through unicode bidi
- render text, start position is determined by is paragraph rtl or ltr
I too would love to have an RTL support (Hebrew).
+1 for rtl support
Think out of the box
use puppeteer
I was able to use Persian font like this, I used this link
http://pdfkit.org/docs/text.html#fonts
doc.font("your language font here")
.text("text");
in my case, I used a Persian font you can use the font you need
How is this still not supported?
Wow, 7 years and still no full RTL-support out of box?โฆ
So I tried pretty much everything but nothing works.
I tried twitter-cldr-js like this:
const bidiText = TwitterCldr.Bidi.from_string('hello ืฉืืื world', { direction: "RTL" });
bidiText.reorder_visually();
return bidiText.toString();
but it gets rendered like this: world ืืืืฉhello
.
Trying icu-bidi results in:
PS C:\Users\...> npm i icu-bidi
npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE package: 'salt@0.5.5',
npm WARN EBADENGINE required: { node: '>=0.6.x <=0.11.x' },
npm WARN EBADENGINE current: { node: 'v14.17.0', npm: '7.20.6' }
npm WARN EBADENGINE }
npm ERR! code 1
npm ERR! path ...
\icu-bidi
npm ERR! command failed
npm ERR! command ...
k-to-build
npm ERR! 'node-pre-gyp' is not recognized as an internal or external command,
npm ERR! operable program or batch file.
npm ERR! A complete log of this run can be found in:
npm ERR! C:\Users\...
ebug.log
The "solution":
const textWithDoubleSpaces = '!world ,ืฉืืื'.replace(' ', ' ');
return textWithDoubleSpaces.split(' ').reverse().join(' ');
will handle Hebrew but not combination of RTL and LTR (it's result with world! ,ืฉืืื
).
unicode-bidirectional give me the following error:
Any working suggestions? ๐
How come this superior library isn't supporting RTL languages?!!
That's ridiculous:)
Though the package has implemented dozens of great functionalities, it's utterly incapable of supporting RTL text.
7 years and still no support:| That's a complete shame for the core developers!
For me I get all the arabic letters parsed correctly on { rtl: true }, but only the numbers are in reverse direction. So I wrote a function, pass the string into it before adding it to the text() function of PdfKit
Before
ู
ุฑูุญุฉ (002 - 001 ู
)
Code
revNumsInString = (s) => {
var x = 0, keep = "", r = 0;
s.replace(/(?:[\d])/gi, (i, q) => {keep += (r == q - 1 ? "" : "|") + i; r = q;});
keep = keep.split("|").map(x => x.split("").reverse().join("")).join("");
return s.replace(/(?:[\d])/gi, (i) =>keep[x++]);
}
Result
ู
ุฑูุญุฉ (200 - 100 ู
)
@AmirABody Kinda wondering, why would say this is a superior library, then?
It requires a higher level layout algorithm than what pdfkit offers, for example https://github.com/foliojs/textkit. React PDF uses it under the hood: https://github.com/diegomura/react-pdf. Not sure if it supports bidi yet but the architecture is there to support it. Personally I think pdfkit is too low level for advanced text layout, and that it belongs in a higher level library like React PDF or pdfmake, but I also don't work on pdfkit much anymore.
still an issue 9 years later.
to my understanding there are 2 challenges:
- bi-directional text rendering (to support RTL and LTR and mixed) --> the words must be in the right order
- layout of the document
- PDF with locale: e.g. ar (arabic) shall be rendered from right to left
- PDF with locale: e.g. en (english) shall be rendered from left to right
regarding point 1. which was discussed above
i think the solution might be to use from opentype specification ...
https://learn.microsoft.com/en-us/typography/opentype/spec/featurelist
the feature rtla
this works with pdfkit already since long time...
please test something like
var doc = new PDFDocument({})
const customFont = fs.readFileSync('./NotoSansArabic-Regular.ttf')
doc.registerFont(Regular
, customFont)
doc.fontSize(15)
doc.font(Regular
).fillColor("black").text("ู
ุฑุญุจุง ููู ุญุงูู")
doc.font(Regular
).fillColor("black").text("ู
ุฑุญุจุง ููู ุญุงูู" , {features: ['rtla']})
doc.font(Regular
).fillColor("black").text("ู
ุฑุญุจุง ููู ุญุงูู" , {features: ['']})
additionally you can mix arabic and non arabic texts and it shall render correctly
or am i wrong ?