This is a Go library to convert various file formats into plaintext and provide related useful functions.
This library is used for https://intelx.io and was successfully tested over 184 million individual files. It is partly written from scratch, partly forked from open source and partly a rewrite of existing code. Many existing libraries lack stability and functionality and this libraries solves that.
We welcome any contributions - please open issues for any feature requests, bugs, and other related issues.
It supports following file formats for plaintext conversion:
- Word: DOC, DOCX, RTF, ODT
- Excel: XLS, XLSX, ODS
- PowerPoint: PPTX
- Ebook: EPUB, MOBI
- Website: HTML
Functions for compressed and container files:
- Decompress files: GZ, BZ, BZ2, XZ
- Extract files from containers: ZIP, RAR, 7Z, TAR
Picture related functions:
- Check if pictures are excessively large
- Compress (and convert) pictures to JPEG: GIF, JPEG, PNG, BMP, TIFF
- Resize and compress pictures
- Extract pictures from PDF files
To download this library:
go get -u github.com/IntelligenceX/fileconversion
And then use it like:
package main
import (
"bytes"
"fmt"
"os"
"github.com/IntelligenceX/fileconversion"
)
const sizeLimit = 2 * 1024 * 1024 // 2 MB
func main() {
// extract text from an XLSX file
file, err := os.Open("Test.xlsx")
if err != nil {
fmt.Printf("Error opening file: %s\n", err)
return
}
defer file.Close()
stat, _ := file.Stat()
buffer := bytes.NewBuffer(make([]byte, 0, sizeLimit))
fileconversion.XLSX2Text(file, stat.Size(), buffer, sizeLimit, -1)
fmt.Println(buffer.String())
}
The package exports the following functions:
XLSX2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64, rowLimit int) (written int64, err error)
DOCX2Text(file io.ReaderAt, size int64) (string, error)
EPUB2Text(file io.ReaderAt, size int64, limit int64) (string, error)
HTML2Text(reader io.Reader) (pageText string, err error)
HTML2TextAndLinks(reader io.Reader, baseURL string) (pageText string, links []string, err error)
Mobi2Text(file io.ReadSeeker) (string, error)
ODS2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)
ODT2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)
PDFListContentStreams(f io.ReadSeeker, w io.Writer, size int64) (written int64, err error)
PPTX2Text(file io.ReaderAt, size int64) (string, error)
RTF2Text(inputRtf string) string
XLS2Text(reader io.ReadSeeker, writer io.Writer, size int64) (written int64, err error)
XLSX2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64, rowLimit int) (written int64, err error)
Picture functions:
IsExcessiveLargePicture(Picture []byte) (excessive bool, err error)
CompressJPEG(Picture []byte, quality int) (compressed []byte)
ResizeCompressPicture(Picture []byte, Quality int, MaxWidth, MaxHeight uint)
PDFExtractImages(input io.ReadSeeker) (images []ImageResult, err error)
Compression and container file functions:
DecompressFile(data []byte) (decompressed []byte, valid bool)
ContainerExtractFiles(data []byte, callback func(name string, size int64, date time.Time, data []byte))
This library uses other go packages. Run the following command to download them:
go get -u github.com/nwaples/rardecode
go get -u github.com/saracen/go7z
go get -u github.com/ulikunitz/xz
go get -u github.com/mattetti/filebuffer
go get -u github.com/richardlehane/mscfb
go get -u github.com/taylorskalyo/goreader/epub
go get -u github.com/PuerkitoBio/goquery
go get -u github.com/ssor/bom
go get -u github.com/levigross/exp-html
go get -u github.com/neofight/mobi/convert
go get -u github.com/neofight/mobi/headers
go get -u github.com/unidoc/unipdf
go get -u github.com/nfnt/resize
go get -u github.com/tealeg/xlsx
go get -u gopkg.in/xmlpath.v2
There are no functional tests. The only test functions are used manually for debugging.
Other packages were tested and either found insufficient, or unstable. Many of the below listed packages were found to be unstable, cause crashes, as well as exhaust memory due to bad programming, bad input sanitizing and bad memory management.
html2text
is forked from https://github.com/jaytaylor/html2textodf
is forked from https://github.com/knieriem/odfole2
is forked and partly rewritten from https://github.com/extrame/ole2xls
is forked from https://github.com/sergeilem/xls which is a fork from https://github.com/extrame/xlsdoc
is forked from https://github.com/EndFirstCorp/doc2txtdocx
is forked from https://github.com/guylaor/gowordmobi
is forked from https://github.com/neofight/mobiodt
is forked from https://github.com/lu4p/catpptx
is forked from https://github.com/mr-tim/rol-o-decksrtf
is forked from https://github.com/J45k4/rtf-go
This is free and unencumbered software released into the public domain.
Note that this package includes, or consists partly of forks or rewrite of existing open source code. Use at your own risk. Intelligence X does not provide any warranty for this library or any parts of it.