/isUTF8

UTF-8 Detector

Primary LanguageGoMIT LicenseMIT

isUTF8

Detect whether a file is well-formed UTF-8 or not.

isUTF8 is written in Go and uses memory mapped files to run as quickly as possible. It uses the golang.org/x/sys/unix package and will probably run only on Unix-like systems (e.g., MacOS, Linux). A portable and simpler but slower approach could use ordinary file I/O and utf8.Valid or utf8.ValidString.

On a 2016 MacBook Pro, isUTF8 checked a 1GB file in around 1 second, about 30% faster than a nearly identical C program compiled with gcc's ‑O3 flag (run times will vary depending on the system and how much of the file is already in memory cache).

For information about well-formed UTF-8 see The Unicode Standard, Chapter 3 Conformance, Table 3-7 Well-Formed UTF-8 Byte Sequences.

Prerequisites

Go programming language.

golang.org/x/sys/unix package. Not part of the standard Go installation so it must be installed separately.

go get golang.org/x/sys/unix

Building

git clone https://github.com/mfuhr/isUTF8.git
cd isUTF8
go test
go build

To install under $GOPATH/bin:

go install

To see test coverage:

go test -coverprofile=coverage.out
go tool cover -func=coverage.out
go tool cover -html=coverage.out

Examples

$ ./isUTF8 testdata/test_utf8.txt
true testdata/test_utf8.txt
$ echo $?
0
$ ./isUTF8 testdata/test_latin1.txt
false testdata/test_latin1.txt
$ echo $?
1

Status

In active development (June 2017). Behavior, especially the output, subject to change.