dbashford/textract

how can i set options with language

ibeeger opened this issue · 9 comments

i want use language chi_sim

where can i set options

All current options are in the readme.

If foreign languages aren't supported by the extractors natively, I'd have to see if they somehow provide support.

Multi language support is outside my expertise. Are there specific options on the underlying extractors you are looking to manipulate?

I use tesseract, it contains parameter settings like
tesseract demo.jpg res -l chi_sim;

textract(type, filePath, config, function( error, text ) {})

I want to know those type config settings

ok, specifically for tesseract, I can look into allow those configuration parms to pass through.

3q

Any chance you could provide an image and then the expected text from that image? Something for me to test with?

I added language support for tesseract.

One thing I had to do to support languages was to update a cleaning regex that I have that is responsible for stripping "non-text". I added \u4E00-\u9FFF to the regex to keep Chinese characters. I did that based on this post on Stack Overflow.

I obviously do not know Chinese. Is it worth adding other ranges? Please let me know.

Released with v0.13.1

If you add all of that may have to do a lot of work,I would first try

Think your response got cut off