Right-to-Left (RTL) support for Hebrew and Arabic

Question

Right-to-Left (RTL) support for Hebrew and Arabic

Opened this issue 11 years ago · 60 comments

Please add Right-to-Left (RTL) support for languages like Hebrew and Arabic...

Something like:

doc.rtl(true);

doc.text('...', {rtl: true});

Answer 1 · 2014-04-05T20:38:31.000Z

I don't know much about RTL languages, but it seems to me that you could reverse the string and align to the right to get this working (I'm probably wrong here). However, if the text contains a combination of LTR and RTL text, then we'll need an implementation of the Unicode Bidi Algorithm. Those who know more than I do, please fill me in. I'd love to see this implemented so PDFKit is more widely usable.

Answer 2 · 2014-04-05T20:40:12.000Z

A separate issue is vertical text support (e.g. Japanese), which I'd also like to see and which has its own challenges.

Answer 3 · 2014-04-05T21:46:41.000Z

Arabic also has its own challenges because letters get a different shape depending on their position in a word (beginning, middle, end) so this is anything but easy :)

Answer 4 · 2014-04-06T16:39:47.000Z

Interesting, I assume there is some sort of algorithm out there to determine this? Starting to sound like a lot of work.

Answer 5 · 2014-08-05T16:15:29.000Z

I'm just parachuting in, but isn't this something like what you need:
https://github.com/mathiasbynens/node-unicode-data

Why re-implement unicode algorithms?

EDIT: Wait a moment, this seems to be quite far from what's needed, my bad. But isn't there a ready implementation?

Answer 6 · 2014-08-05T16:17:40.000Z

No, that's just unicode character metadata, not any actual algorithms. RTL support will require an implementation of the Unicode Bidi Algorithm. Shaping of Arabic text with contextual substitutions is a separate problem to solve.

Answer 7 · 2014-08-05T16:19:58.000Z

Yeah, this library from Twitter might work but I haven't tried it.

Answer 8 · 2014-08-05T16:21:04.000Z

I'll go ahead and fork pdfkit, and see what i can come up with. Any pointers for where to start and how you'd approach it?

Answer 9 · 2014-08-05T16:24:53.000Z

I'd try Twitter's library and see if it produces the results you expect. Sorry for being so ignorant on this, but does it work to run the text through that library, then send the result to the PDFKit doc.text method?

Answer 10 · 2014-08-05T16:31:39.000Z

I found something that might be even more to the point:
https://github.com/cscott/node-icu-bidi

Answer 11 · 2014-08-05T17:00:01.000Z

Yeah, the problem is that node-icu-bidi is a node C++ module, but PDFKit also works in the browser, so everything must be pure JavaScript. If it works for your needs, feel free to use it, but PDFKit won't take on a non-JS dependency.

Answer 12 · 2014-08-05T17:05:31.000Z

I understand, so an acceptable solution would be to extract the BIDI
algorithm from the twitter library, correct?

On Tue, Aug 5, 2014 at 8:00 PM, Devon Govett notifications@github.com
wrote:

Yeah, the problem is that node-icu-bidi is a node C++ module, but PDFKit
also works in the browser, so everything must be pure JavaScript. If it
works for your needs, feel free to use it, but PDFKit won't take on a
non-JS dependency.

—
Reply to this email directly or view it on GitHub
#219 (comment).

Answer 13 · 2015-08-24T22:09:27.000Z

maybe i'm late to the party; just wanted to mention i've implemented (a looong ago) a similar solution in DOS (with the old fashioned 16x16 bitmap fonts); but i think the same approach can be applied here

1- reorder the input string using the Bidi algorithm
2- reshape by applying single glyph substitution depending in the context (beginning, middle, end of the word or standalone glyph).
3- ligatures
4- inverse the alignment (possibly using a RTL flag); if this is supported then a more appropriate naming of alignment options should be : leading/trailing instead of right/left

1 and 4 are the 'easy parts'; for 2 and 3 it's another story: for the OpenType fonts i think there is a GSUB table that can be used for this; but for other font types the only option i think is to implement the specific algorithm for each script (as you said this is a lot of work)

Answer 14 · 2015-08-25T13:30:58.000Z

it seems another solution to Arabic shaping is the use of 'Text based Shaping' that transforms the characters on the string level rather than in the Glyph level (further details are there). And it seems there is already an implementation of this kind in Javascript by the ibm-js team. From the sources it appears that the text engine performs a bunch of operations at the character level:

1- Bidi reordering
2- Text shaping (AFAIK applies only to Arabic scripts)
3- symmetrical swapping (replace [(.. with their symmetrical RTL )]... )
4- Number shaping (replace 'Western-Arabic)' numbers 0, 1,2 ... with their Eastearn Arabic counterparts ٠‎,١,‎٢ ...‎‎)

This can be also a possible fallback to non OpenType fonts which doesn't have a GSUB table

Answer 15 · 2016-08-27T14:57:53.000Z

Getting closer. With v0.8.0 the font engine changed to fontkit, which supports an Arabic shaper (e.g. @yelouafi's steps 2 and 3). Still need to implement the bidi algorithm for mixed script text.

Answer 16 · 2017-03-16T17:35:08.000Z

If you prioritize Bidi reordering, and symmetrical swapping, it's enough for Hebrew support.
While technically Hebrew has characters that look different when they're in the end of the word, you shouldn't care about it because unicode defines them as separate characters.
Text & number shaping can be added later for Arabic support.

Answer 17 · 2017-03-17T11:57:03.000Z

I found the following infos related to this topic. Python Arabic Reshaper is a library which can be used in cases when native Arabic support is not available. The readme contains a good explanation of the issue and the solution. This library has been ported to Javascript.

On the BIDI topic I found this test program written in Javascript.

Answer 18 · 2017-08-12T09:34:26.000Z

There are GSUB (Glyph substitution) tables in font files for Complex languages.
This link explains those tables with example.
https://www.microsoft.com/typography/otfntdev/arabicot/features.aspx

Answer 19 · 2017-10-01T21:31:36.000Z

PDFKIT still has a problem with RTL
any updates? @devongovett

Answer 20 · 2017-11-22T20:07:46.000Z

Hi! @devongovett any update on RTL support? Question 2, is this project dead?

Answer 21 · 2018-03-27T16:56:53.000Z

Please don't be dead :(

Answer 22 · 2018-05-04T19:54:49.000Z

@setpixel I needed this too, but since this doesn't sound that they have added this feature I want to inform you I found jsPDF really useful. they support arabic now.

Answer 23 · 2019-03-18T19:57:47.000Z

pdfkit has more functionality than jsPDF. jsPDF doesn't have full unicode support but pdfkit does. The project and its committers deserve the praise. For RTL, right-aligned text works very well. However, when we want to use columns, things change. The need is just to start from right-most column through the left most column. @devongovett we don't need anything except this I think because the RTL text has its RTL way, no need to reverse the strings. (same for LTR inside RTL)

Answer 24 · 2019-03-19T06:06:10.000Z

RTL is much more than right aligned text. There’s the issue of comma and dot positions, and what happens when LTR stuff like numbers and English text are mixed in a sentence.

Answer 25 · 2019-07-10T10:37:50.000Z

mayassalman commented 5 years ago

😕1

Answer 26 · 2019-09-05T09:59:40.000Z

I was able to manually handle Hebrew with code like:

npm install twitter_cldr

import * as TwitterCldrLoader from "twitter_cldr";

const TwitterCldr = TwitterCldrLoader.load("en");

class ... {
  private isHebrew(text: string) {
    var position = text.search(/[\u0590-\u05FF]/);
    return position >= 0;
  }

  private maybeRtlize(text: string) {
    if (this.isHebrew(text)) {
      var bidiText = TwitterCldr.Bidi.from_string(text, { direction: "RTL" });
      bidiText.reorder_visually();
      return bidiText.toString();
    } else {
      return text;
    }
  }
}

Just pass all text that may be in Hebrew through the maybeRtlize function.

It's not perfect and I only tested it for Hebrew, but it seems to work pretty good. If you also need right alignment, use something like isHebrew(myText) ? { align: "right" } : null for alignment.

The problem is that if the text wraps onto multiple lines, the first word of the text will be on the last line, which is wrong. There needs to be more logic added to handle line breaks.

Answer 27 · 2019-10-16T17:58:51.000Z

is there any solution to support urdu and arabic in pdfkit till now.

Answer 28 · 2019-10-31T15:59:07.000Z

Simply reversing the text before it goes to pdfkit seems to work for both Hebrew and Arabic (I'm just eyeballing the text however since I speak neither)

const isHebrew = (text) => {
  return text.search(/[\u0590-\u05FF]/) >= 0;
};

const isArabic = (text) => {
  return text.search(/[\u0600-\u06FF]/) >= 0;
};


const rightToLeftText = (text) => {
  if (isHebrew(text) || isArabic(text)) {
    return text.split(' ').reverse().join(' ');
  } else {
    return text;
  }
};

rightToLeftText('أنا أتحدث اللغة العربية');
rightToLeftText('אני מדברת עברית');

Answer 29 · 2019-10-31T17:52:13.000Z

You have to pay attention that Arabic script has symbols that combine with neighboring symbols based on what they are and where they are.

Answer 30 · 2019-10-31T17:53:42.000Z

return text.split(' ').reverse().join(' ');

What of non-RTL text? Like a mix between Hebrew and numbers. Or English and Arabic. Consider the following example:

יש לי 500 tokenים של globus2000.
By the way, looks like GitHub is treating this wrong xD

Answer 31 · 2019-10-31T18:03:50.000Z

return text.split(' ').reverse().join(' ');

What of non-RTL text? Like a mix between Hebrew and numbers. Or English and Arabic. Consider the following example:

יש לי 500 tokenים של globus2000.
By the way, looks like GitHub is treating this wrong xD

Fair point but it doesn't actually apply in my use case. Perhaps splitting it into RTL and LTR chunks and then only reversing the RTL chunks would work? Worth a shot, especially since none of the other solution in here worked for me.

Answer 32 · 2019-11-01T01:57:56.000Z

That’s pretty much what the Unicode bidi algorithm does: http://www.unicode.org/reports/tr9/

Answer 33 · 2019-11-04T18:14:10.000Z

That’s pretty much what the Unicode bidi algorithm does: http://www.unicode.org/reports/tr9/

Will the bidi algorithm be embedded in pdfkit?

Answer 34 · 2019-11-04T19:34:06.000Z

Sure, if someone wants to implement it.

Answer 35 · 2019-11-07T00:24:16.000Z

There's a JS implementation of tr9
https://github.com/bbc/unicode-bidirectional
not sure how accurate it is.

Answer 36 · 2019-12-12T00:12:43.000Z

Simply reversing the text before it goes to pdfkit seems to work for both Hebrew and Arabic (I'm just eyeballing the text however since I speak neither)

const isHebrew = (text) => {
  return text.search(/[\u0590-\u05FF]/) >= 0;
};

const isArabic = (text) => {
  return text.search(/[\u0600-\u06FF]/) >= 0;
};


const rightToLeftText = (text) => {
  if (isHebrew(text) || isArabic(text)) {
    return text.split(' ').reverse().join(' ');
  } else {
    return text;
  }
};

rightToLeftText('أنا أتحدث اللغة العربية');
rightToLeftText('אני מדברת עברית');

This is exactly what I am looking for. Just a bit improvement:
For RTL languages like persian (as I use it), add a space to the end of the string:
text.split(' ').reverse().join(' ') + ' ';
This will work like a charm!!!
Remember that if your string have special characters (e.g. ":") at the end, put it before added white space.

Answer 37 · 2019-12-12T11:06:00.000Z

Just a note for whoever is still stuck on this that reversing the text is not a good idea. It will reverse things like numbers and various other things that should not be reversed. 123456 might result in being reversed to 654321

Use a library meant for this, like TwitterCldr, see #219 (comment)

Answer 38 · 2019-12-12T13:29:05.000Z

Just a note for whoever is still stuck on this that reversing the text is not a good idea. It will reverse things like numbers and various other things that should not be reversed. 123456 might result in being reversed to 654321

Use a library meant for this, like TwitterCldr, see #219 (comment)

Note: We are reversing array of words, not array of characters!!!
I am trying twitterCLDR and problem still persists. In my case, problem isn't about character ordering, it is about white spaces. If you are using linux, as I, just install suitable language package, this will resolve character ordering and it will not be a problem anymore. TwitterCLDR is good for white space ordering but it operates character ordering simultaneously, and it is not good. The best manipulation is reverse() for me.

Answer 39 · 2019-12-12T13:40:47.000Z

@weera-tech the actual letters need to be reversed too. Not just the word order is supposed to be reversed in rtl writing.

Answer 40 · 2019-12-12T13:43:54.000Z

@weera-tech the actual letters need to be reversed too. Not just the word order is supposed to be reversed in rtl writing.

You are right, but I said that first install suitable language package, in RTL direction, you have to set align to right. Therefore it will have conflict with TCLDR character ordering. simple: -1 * -1 = 1 :)

Answer 41 · 2019-12-12T13:56:23.000Z

I'm not sure what sort of mechanism would actually reverse characters for you, but not words, considering pdfkit has no rtl support whatsoever. Perhaps something weird is happening on Linux. I'm using pdfkit in the browser with webpack.

In my experience, and I have a production app using this approach with TwitterCLDR and pdfkit, simply reversing words resulted in support tickets being issued for exactly this problem. Words where in the correct order, but letters were in the wrong order.

Answer 42 · 2019-12-12T14:12:00.000Z

Ooops!!!
You are using it in client-side? I am using server-side. Probably this is our difference.

Answer 43 · 2019-12-12T15:26:04.000Z

The only correct implementation will be the Unicode bidi algorithm. Anything else, especially reverse(), will be incorrect.

Answer 44 · 2019-12-12T16:21:33.000Z

There is a recent WASM build of the HarfBuzz engine which is a text shaping engine used by Firefox Chrome, and others.

https://github.com/harfbuzz/harfbuzzjs

It does support Unicode bidi algorithms among other things. I believe it could be integrated with pdfkit to solve RTL once and for all.

There is a demo here: https://harfbuzz.github.io/harfbuzzjs/

Some discussion about it being used to solve RTL issues for Photopea, which is a very popular online image editor: harfbuzz/harfbuzzjs#10

Unfortunately I'm not familiar at all with pdfkit's text rendering, but perhaps someone could look into it.

Answer 45 · 2020-01-16T10:51:30.000Z

Hey,

Any news with RTL support?

Answer 46 · 2020-01-16T13:36:17.000Z

@devongovett from my limited understanding of fontkit it seems that it does indeed support rtl.

I found this site and I was able to see rtl text being rendered properly.
https://fontkit-demo.now.sh/

Also from what I understand, pdfkit is based on fontkit so what is stopping this from working?

Answer 47 · 2020-01-16T14:23:53.000Z

@andreialecu because RTL support is more than glyph rendering

The only proper way to render rtl language is

determine flow of the paragraph (rtl or ltr)
run text through unicode bidi
render text, start position is determined by is paragraph rtl or ltr

Answer 48 · 2020-03-31T05:29:19.000Z

I too would love to have an RTL support (Hebrew).

Answer 49 · 2020-09-30T05:31:23.000Z

+1 for rtl support

Answer 50 · 2020-09-30T07:28:02.000Z

Think out of the box
use puppeteer

Answer 51 · 2021-02-02T08:54:36.000Z

I was able to use Persian font like this, I used this link
http://pdfkit.org/docs/text.html#fonts

doc.font("your language font here")
   .text("text");

in my case, I used a Persian font you can use the font you need

Answer 52 · 2021-05-23T06:11:52.000Z

How is this still not supported?

Answer 53 · 2021-06-30T20:33:39.000Z

Wow, 7 years and still no full RTL-support out of box?…

Answer 54 · 2021-08-26T09:50:17.000Z

So I tried pretty much everything but nothing works.
I tried twitter-cldr-js like this:

const bidiText = TwitterCldr.Bidi.from_string('hello שלום world', { direction: "RTL" });
bidiText.reorder_visually();
return bidiText.toString();

but it gets rendered like this: world םולשhello.
Trying icu-bidi results in:

PS C:\Users\...> npm i icu-bidi
npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE   package: 'salt@0.5.5',
npm WARN EBADENGINE   required: { node: '>=0.6.x <=0.11.x' },
npm WARN EBADENGINE   current: { node: 'v14.17.0', npm: '7.20.6' }
npm WARN EBADENGINE }
npm ERR! code 1
npm ERR! path ...
\icu-bidi
npm ERR! command failed
npm ERR! command ...
k-to-build
npm ERR! 'node-pre-gyp' is not recognized as an internal or external command,
npm ERR! operable program or batch file.

npm ERR! A complete log of this run can be found in:
npm ERR!     C:\Users\...
ebug.log

The "solution":

const textWithDoubleSpaces = '!world ,שלום'.replace(' ', '  ');
return textWithDoubleSpaces.split(' ').reverse().join('  ');

will handle Hebrew but not combination of RTL and LTR (it's result with world! ,שלום).
unicode-bidirectional give me the following error:

Any working suggestions? 🙏

Answer 55 · 2021-11-07T13:27:20.000Z

How come this superior library isn't supporting RTL languages?!!
That's ridiculous:)
Though the package has implemented dozens of great functionalities, it's utterly incapable of supporting RTL text.
7 years and still no support:| That's a complete shame for the core developers!

Answer 56 · 2021-12-06T14:18:34.000Z

For me I get all the arabic letters parsed correctly on { rtl: true }, but only the numbers are in reverse direction. So I wrote a function, pass the string into it before adding it to the text() function of PdfKit

Before

مروحة (002 - 001 م)

Code

revNumsInString = (s) => {
    var x = 0, keep = "", r = 0;
    s.replace(/(?:[\d])/gi, (i, q) => {keep += (r == q - 1 ? "" : "|") + i; r = q;});
    keep = keep.split("|").map(x => x.split("").reverse().join("")).join("");
    return s.replace(/(?:[\d])/gi, (i) =>keep[x++]);
}

Result

مروحة (200 - 100 م)

Answer 57 · 2021-12-06T18:04:20.000Z

@AmirABody Kinda wondering, why would say this is a superior library, then?

Answer 58 · 2021-12-06T18:15:30.000Z

It requires a higher level layout algorithm than what pdfkit offers, for example https://github.com/foliojs/textkit. React PDF uses it under the hood: https://github.com/diegomura/react-pdf. Not sure if it supports bidi yet but the architecture is there to support it. Personally I think pdfkit is too low level for advanced text layout, and that it belongs in a higher level library like React PDF or pdfmake, but I also don't work on pdfkit much anymore.

Answer 59 · 2023-11-22T12:41:09.000Z

still an issue 9 years later.

Answer 60 · 2024-01-17T12:53:01.000Z

to my understanding there are 2 challenges:

bi-directional text rendering (to support RTL and LTR and mixed) --> the words must be in the right order
layout of the document

PDF with locale: e.g. ar (arabic) shall be rendered from right to left
PDF with locale: e.g. en (english) shall be rendered from left to right

regarding point 1. which was discussed above
i think the solution might be to use from opentype specification ...
https://learn.microsoft.com/en-us/typography/opentype/spec/featurelist
the feature rtla
this works with pdfkit already since long time...

please test something like

var doc = new PDFDocument({})
const customFont = fs.readFileSync('./NotoSansArabic-Regular.ttf')
doc.registerFont(Regular, customFont)
doc.fontSize(15)
doc.font(Regular).fillColor("black").text("مرحبا كيف حالك")
doc.font(Regular).fillColor("black").text("مرحبا كيف حالك" , {features: ['rtla']})
doc.font(Regular).fillColor("black").text("مرحبا كيف حالك" , {features: ['']})

additionally you can mix arabic and non arabic texts and it shall render correctly

or am i wrong ?