A text extraction node module.
- HTML, HTM
- Markdown
- XML, XSL
- DOC, DOCX
- ODT, OTT (experimental, feedback needed!)
- RTF
- XLS, XLSX, XLSB, XLSM, XLTX
- ODS, OTS
- PPTX, POTX
- ODP, OTP
- ODG, OTG
- PNG, JPG, GIF
- DXF
application/javascript- All
text/*mime-types.
In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other file types.
Does textract not extract from files of the type you need? Add an issue or submit a pull request. It many cases textract is already capable, it is just not paying attention to the mime type you may be interested in.
npm install textract
Note, if any of the requirements below are missing, textract will run and extract all files for types it is capable. Not having these items installed does not prevent you from using textract, it just prevents you from extracting those specific files.
PDFextraction requirespdftotextbe installed, linkDOC,RTFextraction requirescatdocbe installed, link, unless on OSX in which case textutil (installed by default) is used.PNG,JPGandGIFrequiretesseractto be available, link. Images need to be pretty clear, high DPI and made almost entirely of just text fortesseractto be able to accurately extract the text.DXFextraction requiresdrawingtotextbe available, link
Configuration can be passed into textract. The following configuration options are available
preserveLineBreaks: When using the command line this is set totrueto preserve stdout readability. When using the library via node this is set tofalse. Pass this in astrueand textract will not strip any line breaks.exec: Some extractors (dxf) use node'sexecfunctionality. This setting allows for providing config toexecexecution. One reason you might want to provide this config is if you are dealing with very large files. You might want to increase theexecmaxBuffersetting.[ext].exec: Each extractor can take specific exec config. Keep in mind many extractors are responsible for extracting multiple types, so, for instance, theodtextractor is what you would configure forodtandodg/odtetc. Check the extractors to see which you want to specifically configure. At the bottom of each is a list oftypesfor which the extractor is responsible.tesseract.lang: A pass-through to tesseract allowing for setting of language for extraction. ex:{ tesseract: { lang:"chi_sim" } }
If textract is installed gloablly, via npm install -g textract, then the following command will write the extracted text to the console for a file on the file system.
$ textract pathToFile
Configuration flags can be passed into textract via the command line.
textract pathToFile --preserveLineBreaks false
Parameters like exec.maxBuffer can be passed as you'd expect.
textract pathToFile --exec.maxBuffer 500000
And multiple flags can be used together.
textract pathToFile --preserveLineBreaks false --exec.maxBuffer 500000
var textract = require('textract');There are several ways to extract text. For all methods, the extracted text and an error object are passed to a callback.
error will contain informative text about why the extraction failed. If textract does not currently extract files of the type provided, a typeNotFound flag will be tossed on the error object.
textract.fromFileWithPath(filePath, function( error, text ) {})textract.fromFileWithPath(filePath, config, function( error, text ) {})textract.fromFileWithMimeAndPath(type, filePath, function( error, text ) {})textract.fromFileWithMimeAndPath(type, filePath, config, function( error, text ) {})textract.fromBufferWithMime(type, buffer, function( error, text ) {})textract.fromBufferWithMime(type, buffer, config, function( error, text ) {})textract.fromBufferWithName(name, buffer, function( error, text ) {})textract.fromBufferWithName(name, buffer, config, function( error, text ) {})textract.fromUrl(url, function( error, text ) {})textract.fromUrl(url, config, function( error, text ) {})- #53. Cleared up documentation around CLI and line breaks.
- #54. PR removed
disableCatdocWordWrapas an option, instead always disabling catdoc's word wrapping. - #55. PR removed clobbering of non-boolean flags on CLI.
- #52. PR fixed CLI post big API changes.
- #51. Fixed issue with large files using unzip returning blank string.
- #49 Updated messages when extractors are not available to be purely informational, since textract will work just fine without some of its extractors.
- #50. Updated way in which catdoc was detected to not rely on file being test extracted.
- Overhaul of interface. To simplify the code, the original
textractfunction was broken intotextract.fromFileWithPathandtextract.fromFileWithMimeAndPath. - #41. Added support for pulling files from a URL.
- #40. Added support for extracting text from a node
Buffer. This prevents you from having to write the file to disk first. textract does have to write the file to disk itself, but because it is a textract requirement that files be on disk textract should be able to take care of that for you. Two new functions,textract.fromBufferWithNameandtextract.fromBufferWithMimehave been added. textract needs to either know the file name or the mime type to extract a buffer. - Added entity decoding, so encoded items like
<,>,",', and&will show up appropriately in the text. - Removed external dependency on
unzip - #38. Added markdown support.
- #31. Added initial ODT support. Feedback needed if there is any trouble. Also added OTT support.
- Added support for ODS, OTS.
- Added support for XML, XSL.
- Added support for POTX.
- Added support for XLTX, XLTS.
- Added support for ODG, OTG.
- Added support for ODP, OTP.
- Pull Request #39 added support for not work wrapping with catdoc.
- #30, #34. The command line has been improved, allowing for all the configuration options to be provided.
- Updated character stripping regex to be more lenient.
- Added HTML extraction.
- Added ability for extractors to register for specific extensions (not yet used). This handles cases where extensions (like
.webarchive) do not have recognized mime types.
- Addressed some lingering regex issues from previous release.
- Added tests for RTF, more tests for DOC
- #29 Introduced new extractor for
.docand.rtffor OSX only. All non-OSX operating systems will continue to usecatdoc. Going forward, because of issues gettingcatdocinstalled on OSX, on OSX onlytextutilwill be used.textutilcomes default installed with OSX.
- #29 which resulted in the following changes:
- writing info messages to
stderrwhen extractors taking awhile to get going - no longer removing …
- centralized some cleansing regexes, also no longer removing multiple back to back spaces using
\sas it was removing any back to back newlines. Now scoping back to back replacing to[\t\v\u00A0].
- #27, addressed issues with page ordering in
pptxextraction.
- #25, added language support for tesseract, see
tesseract.langproperty. - Updated regex that strips bad characters to not strip (some) chinese characters. The regex will likely need updating by someonw more familiar with Chinese. =)
- #26, using
os.tmpdir()rather than a temp dir inside textract. - Upgraded to latest
j(dependency) - Removed
macProcessGifoption and tests as tesseract seems to work on Mac just fine now
- #21, #22, Now using j via its binaries rather than using it via node. This makes XLS/X extraction slower, but reduces memory consumption of textract signifcantly.
- Updated pdf-text-extract to latest, fixes #20.
- Addressed path escaping issues with tesseract, fixes [#18] (dbashford#18)
- Using j to handle
xlsandxlsx, this removes the requirement on thexls2csvbinary. - j also supports
xlsbandxlsm

