Okio is a library that complements java.io
and java.nio
to make it much
easier to access, store, and process your data. It started as a component of
OkHttp, the capable HTTP client included in Android. It's well-exercised
and ready to solve new problems.
Okio is built around two types that pack a lot of capability into a straightforward API:
-
ByteString is an immutable sequence of bytes. For character data,
String
is fundamental.ByteString
is String's long-lost brother, making it easy to treat binary data as a value. This class is ergonomic: it knows how to encode and decode itself as hex, base64, and UTF-8. -
Buffer is a mutable sequence of bytes. Like
ArrayList
, you don't need to size your buffer in advance. You read and write buffers as a queue: write data to the end and read it from the front. There's no obligation to manage positions, limits, or capacities.
Internally, ByteString
and Buffer
do some clever things to save CPU and
memory. If you encode a UTF-8 string as a ByteString
, it caches a reference to
that string so that if you decode it later, there's no work to do.
Buffer
is implemented as a linked list of segments. When you move data from
one buffer to another, it reassigns ownership of the segments rather than
copying the data across. This approach is particularly helpful for multithreaded
programs: a thread that talks to the network can exchange data with a worker
thread without any copying or ceremony.
An elegant part of the java.io
design is how streams can be layered for
transformations like encryption and compression. Okio includes its own stream
types called Source
and Sink
that work like InputStream
and
OutputStream
, but with some key differences:
-
Timeouts. The streams provide access to the timeouts of the underlying I/O mechanism. Unlike the
java.io
socket streams, bothread()
andwrite()
calls honor timeouts. -
Easy to implement.
Source
declares three methods:read()
,close()
, andtimeout()
. There are no hazards likeavailable()
or single-byte reads that cause correctness and performance surprises. -
Easy to use. Although implementations of
Source
andSink
have only three methods to write, callers are given a rich API with theBufferedSource
andBufferedSink
interfaces. These interfaces give you everything you need in one place. -
No artificial distinction between byte streams and char streams. It's all data. Read and write it as bytes, UTF-8 strings, big-endian 32-bit integers, little-endian shorts; whatever you want. No more
InputStreamReader
! -
Easy to test. The
Buffer
class implements bothBufferedSource
andBufferedSink
so your test code is simple and clear.
Sources and sinks interoperate with InputStream
and OutputStream
. You can
view any Source
as an InputStream
, and you can view any InputStream
as a
Source
. Similarly for Sink
and OutputStream
.
We've written some recipes that demonstrate how to solve common problems with Okio. Read through them to learn about how everything works together. Cut-and-paste these examples freely; that's what they're for.
Use Okio.source(File)
to open a source stream to read a file. The returned
Source
interface is very small and has limited uses. Instead we wrap the
source with a buffer. This has two benefits:
-
It makes the API more powerful. Instead of the basic methods offered by
Source
,BufferedSource
has dozens of methods to address most common problems concisely. -
It makes your program run faster. Buffering allows Okio to get more done with fewer I/O operations.
Each Source
that is opened needs to be closed. The code that opens the stream
is responsible for making sure it is closed. Here we use Java's try
blocks to
close our sources automatically.
public void readLines(File file) throws IOException {
try (Source fileSource = Okio.source(file);
BufferedSource bufferedSource = Okio.buffer(fileSource)) {
while (true) {
String line = bufferedSource.readUtf8Line();
if (line == null) break;
if (line.contains("square")) {
System.out.println(line);
}
}
}
}
The readUtf8Line()
API reads all of the data until the next line delimiter –
either \n
, \r\n
, or the end of the file. It returns that data as a string,
omitting the delimiter at the end. When it encounters empty lines the method
will return an empty string. If there isn’t any more data to read it will
return null.
The above program can be written more compactly by inlining the fileSource
variable and by using a fancy for
loop instead of a while
:
public void readLines(File file) throws IOException {
try (BufferedSource source = Okio.buffer(Okio.source(file))) {
for (String line; (line = source.readUtf8Line()) != null; ) {
if (line.contains("square")) {
System.out.println(line);
}
}
}
}
The readUtf8Line()
method is suitable for parsing most files. For certain
use-cases you may also consider readUtf8LineStrict()
. It is similar but it
requires that each line is terminated by \n
or \r\n
. If it encounters the
end of the file before that it will throw an EOFException
. The strict variant
also permits a byte limit to defend against malformed input.
public void readLines(File file) throws IOException {
try (BufferedSource source = Okio.buffer(Okio.source(file))) {
while (!source.exhausted()) {
String line = source.readUtf8LineStrict(1024L);
if (line.contains("square")) {
System.out.println(line);
}
}
}
}
Above we used a Source
and a BufferedSource
to read a file. To write, we use
a Sink
and a BufferedSink
. The advantages of buffering are the same: a more
capable API and better performance.
public void writeEnv(File file) throws IOException {
try (Sink fileSink = Okio.sink(file);
BufferedSink bufferedSink = Okio.buffer(fileSink)) {
for (Map.Entry<String, String> entry : System.getenv().entrySet()) {
bufferedSink.writeUtf8(entry.getKey());
bufferedSink.writeUtf8("=");
bufferedSink.writeUtf8(entry.getValue());
bufferedSink.writeUtf8("\n");
}
}
}
There isn’t an API to write a line of input; instead we manually insert our own
newline character. Most programs should hardcode "\n"
as the newline
character. In rare situations you may use System.lineSeparator()
instead of
"\n"
: it returns "\r\n"
on Windows and "\n"
everywhere else.
We can write the above program more compactly by inlining the fileSink
variable and by taking advantage of method chaining:
public void writeEnv(File file) throws IOException {
try (BufferedSink sink = Okio.buffer(Okio.sink(file))) {
for (Map.Entry<String, String> entry : System.getenv().entrySet()) {
sink.writeUtf8(entry.getKey())
.writeUtf8("=")
.writeUtf8(entry.getValue())
.writeUtf8("\n");
}
}
}
In the above code we make four calls to writeUtf8()
. Making four calls is
more efficient than the code below because the VM doesn’t have to create and
garbage collect a temporary string.
sink.writeUtf8(entry.getKey() + "=" + entry.getValue() + "\n"); // Slower!
In the above APIs you can see that Okio really likes UTF-8. Early computer systems suffered many incompatible character encodings: ISO-8859-1, ShiftJIS, ASCII, EBCDIC, etc. Writing software to support multiple character sets was awful and we didn’t even have emoji! Today we're lucky that the world has standardized on UTF-8 everywhere, with some rare uses of other charsets in legacy systems.
If you need another character set, readString()
and writeString()
are there
for you. These methods require that you specify a character set. Otherwise you
may accidentally create data that is only readable by the local computer. Most
programs should use the UTF-8 methods only.
When encoding strings you need to be mindful of the different ways that strings
are represented and encoded. When a glyph has an accent or another adornment
it may be represented as a single complex code point (é
) or as a simple code
point (e
) followed by its modifiers (´
). When the entire glyph is a single
code point that’s called NFC; when it’s multiple it’s NFD.
Though we use UTF-8 whenever we read or write strings in I/O, when they are in
memory Java Strings use an obsolete character encoding called UTF-16. It is a
bad encoding because it uses a 16-bit char
for most characters, but some don’t
fit. In particular, most emoji use two Java chars. This is problematic because
String.length()
returns a surprising result: the number of UTF-16 chars and
not the natural number of glyphs.
Café 🍩 | Café 🍩 | |
---|---|---|
Form | NFC | NFD |
Code Points | c a f é ␣ 🍩 |
c a f e ´ ␣ 🍩 |
UTF-8 bytes | 43 61 66 c3a9 20 f09f8da9 |
43 61 66 65 cc81 20 f09f8da9 |
String.codePointCount | 6 | 7 |
String.length | 7 | 8 |
Utf8.size | 10 | 11 |
For the most part Okio lets you ignore these problems and focus on your data. But when you need them, there are convenient APIs for dealing with low-level UTF-8 strings.
Use Utf8.size()
to count the number of bytes required to encode a string as
UTF-8 without actually encoding it. This is handy in length-prefixed encodings
like protocol buffers.
Use BufferedSource.readUtf8CodePoint()
to read a single variable-length code
point, and BufferedSink.writeUtf8CodePoint()
to write one.
Okio likes testing. The library itself is heavily tested, and it has features that are often helpful when testing application code. One pattern we’ve found to be quite useful is “golden value” testing. The goal of such tests is to confirm that data encoded with earlier versions of a program can safely be decoded by the current program.
We’ll illustrate this by encoding a value using Java Serialization. Though we
must disclaim that Java Serialization is an awful encoding system and most
programs should prefer other formats like JSON or protobuf! In any case, here’s
a method that takes an object, serializes it, and returns the result as a
ByteString
:
private ByteString serialize(Object o) throws IOException {
Buffer buffer = new Buffer();
try (ObjectOutputStream objectOut = new ObjectOutputStream(buffer.outputStream())) {
objectOut.writeObject(o);
}
return buffer.readByteString();
}
There’s a lot going on here.
-
We create a buffer as a holding space for our serialized data. It’s a convenient replacement for
ByteArrayOutputStream
. -
We ask the buffer for its output stream. Writes to a buffer or its output stream always append data to the end of the buffer.
-
We create an
ObjectOutputStream
(the encoding API for Java serialization) and write our object. The try block takes care of closing the stream for us. Note that closing a buffer has no effect. -
Finally we read a byte string from the buffer. The
readByteString()
method allows us to specify how many bytes to read; here we don’t specify a count in order to read the entire thing. Reads from a buffer always consume data from the front of the buffer.
With our serialize()
method handy we are ready to compute and print a golden
value.
Point point = new Point(8.0, 15.0);
ByteString pointBytes = serialize(point);
System.out.println(pointBytes.base64());
We print the ByteString
as base64 because it’s a compact format
that’s suitable for embedding in a test case. The program prints this:
rO0ABXNyAB5va2lvLnNhbXBsZXMuR29sZGVuVmFsdWUkUG9pbnTdUW8rMji1IwIAAkQAAXhEAAF5eHBAIAAAAAAAAEAuAAAAAAAA
That’s our golden value! We can embed it in our test case using base64 again
to convert it back into a ByteString
:
ByteString goldenBytes = ByteString.decodeBase64("rO0ABXNyAB5va2lvLnNhbXBsZ"
+ "XMuR29sZGVuVmFsdWUkUG9pbnTdUW8rMji1IwIAAkQAAXhEAAF5eHBAIAAAAAAAAEAuA"
+ "AAAAAAA");
The next step is to deserialize the ByteString
back into our value class. This
method reverses the serialize()
method above: we append a byte string to a
buffer then consume it using an ObjectInputStream
:
private Object deserialize(ByteString byteString) throws IOException, ClassNotFoundException {
Buffer buffer = new Buffer();
buffer.write(byteString);
try (ObjectInputStream objectIn = new ObjectInputStream(buffer.inputStream())) {
return objectIn.readObject();
}
}
Now we can test the decoder against the golden value:
ByteString goldenBytes = ByteString.decodeBase64("rO0ABXNyAB5va2lvLnNhbXBsZ"
+ "XMuR29sZGVuVmFsdWUkUG9pbnTdUW8rMji1IwIAAkQAAXhEAAF5eHBAIAAAAAAAAEAuA"
+ "AAAAAAA");
Point decoded = (Point) deserialize(goldenBytes);
assertEquals(new Point(8.0, 15.0), decoded);
With this test we can change the serialization of the Point
class without
breaking compatibility.
Encoding a binary file is not unlike encoding a text file. Okio uses the same
BufferedSink
and BufferedSource
bytes for both. This is handy for binary
formats that include both byte and character data.
Writing binary data is more hazardous than text because if you make a mistake it is often quite difficult to diagnose. Avoid such mistakes by being careful around these traps:
-
The width of each field. This is the number of bytes used. Okio doesn't include a mechanism to emit partial bytes. If you need that, you’ll need to do your own bit shifting and masking before writing.
-
The endianness of each field. All fields that have more than one byte have endianness: whether the bytes are ordered most-significant to least (big endian) or least-significant to most (little endian). Okio uses the
Le
suffix for little-endian methods; methods without a suffix are big-endian. -
Signed vs. Unsigned. Java doesn’t have unsigned primitive types (except for
char
!) so coping with this is often something that happens at the application layer. To make this a little easier Okio acceptsint
types forwriteByte()
andwriteShort()
. You can pass an “unsigned” byte like 255 and Okio will do the right thing.
Value | Method | Width | Endianness | Encoded Bytes |
---|---|---|---|---|
3 | writeByte | 1 | 03 |
|
3 | writeShort | 2 | big | 00 03 |
3 | writeInt | 4 | big | 00 00 00 03 |
3 | writeLong | 8 | big | 00 00 00 00 00 00 00 03 |
3 | writeShortLe | 2 | little | 03 00 |
3 | writeIntLe | 4 | little | 03 00 00 00 |
3 | writeLongLe | 8 | little | 03 00 00 00 00 00 00 00 |
Byte.MAX_VALUE | writeByte | 1 | 7f |
|
Short.MAX_VALUE | writeShort | 2 | big | 7f ff |
Int.MAX_VALUE | writeInt | 4 | big | 7f ff ff ff |
Long.MAX_VALUE | writeLong | 8 | big | 7f ff ff ff ff ff ff ff |
Short.MAX_VALUE | writeShortLe | 2 | little | ff 7f |
Int.MAX_VALUE | writeIntLe | 4 | little | ff ff ff 7f |
Long.MAX_VALUE | writeLongLe | 8 | little | ff ff ff ff ff ff ff 7f |
This code encodes a bitmap following the BMP file format.
void encode(Bitmap bitmap, BufferedSink sink) throws IOException {
int height = bitmap.height();
int width = bitmap.width();
int bytesPerPixel = 3;
int rowByteCountWithoutPadding = (bytesPerPixel * width);
int rowByteCount = ((rowByteCountWithoutPadding + 3) / 4) * 4;
int pixelDataSize = rowByteCount * height;
int bmpHeaderSize = 14;
int dibHeaderSize = 40;
// BMP Header
sink.writeUtf8("BM"); // ID.
sink.writeIntLe(bmpHeaderSize + dibHeaderSize + pixelDataSize); // File size.
sink.writeShortLe(0); // Unused.
sink.writeShortLe(0); // Unused.
sink.writeIntLe(bmpHeaderSize + dibHeaderSize); // Offset of pixel data.
// DIB Header
sink.writeIntLe(dibHeaderSize);
sink.writeIntLe(width);
sink.writeIntLe(height);
sink.writeShortLe(1); // Color plane count.
sink.writeShortLe(bytesPerPixel * Byte.SIZE);
sink.writeIntLe(0); // No compression.
sink.writeIntLe(16); // Size of bitmap data including padding.
sink.writeIntLe(2835); // Horizontal print resolution in pixels/meter. (72 dpi).
sink.writeIntLe(2835); // Vertical print resolution in pixels/meter. (72 dpi).
sink.writeIntLe(0); // Palette color count.
sink.writeIntLe(0); // 0 important colors.
// Pixel data.
for (int y = height - 1; y >= 0; y--) {
for (int x = 0; x < width; x++) {
sink.writeByte(bitmap.blue(x, y));
sink.writeByte(bitmap.green(x, y));
sink.writeByte(bitmap.red(x, y));
}
// Padding for 4-byte alignment.
for (int p = rowByteCountWithoutPadding; p < rowByteCount; p++) {
sink.writeByte(0);
}
}
}
The trickiest part of this program is the format’s required padding. The BMP format expects each row to begin on a 4-byte boundary so it is necessary to add zeros to maintain the alignment.
Encoding other binary formats is usually quite similar. Some tips:
- Write tests with golden values! Confirming that your program emits the expected result can make debugging easier.
- Use
Utf8.size()
to compute the number of bytes of an encoded string. This is essential for length-prefixed formats. - Use
Float.floatToIntBits()
andDouble.doubleToLongBits()
to encode floating point values.
This module contains microbenchmarks that can be used to measure various aspects of performance for Okio buffers. Okio benchmarks are written using JMH (version 1.4.1 at this time) and require Java 7.
To run benchmarks locally, first build and package the project modules:
$ mvn clean package
This should create a benchmarks.jar
file in the target
directory, which is a typical JMH benchmark JAR:
$ java -jar benchmarks/target/benchmarks.jar -l
Benchmarks:
com.squareup.okio.benchmarks.BufferPerformanceBench.cold
com.squareup.okio.benchmarks.BufferPerformanceBench.threads16hot
com.squareup.okio.benchmarks.BufferPerformanceBench.threads1hot
com.squareup.okio.benchmarks.BufferPerformanceBench.threads2hot
com.squareup.okio.benchmarks.BufferPerformanceBench.threads32hot
com.squareup.okio.benchmarks.BufferPerformanceBench.threads4hot
com.squareup.okio.benchmarks.BufferPerformanceBench.threads8hot
More help is available using the -h
option. A typical run on Mac OS X looks like:
$ /usr/libexec/java_home -v 1.7 --exec java -jar benchmarks/target/benchmarks.jar \
"cold" -prof gc,hs_rt,stack -r 60 -t 4 \
-jvmArgsPrepend "-Xms1G -Xmx1G -XX:+HeapDumpOnOutOfMemoryError"
This executes the "cold" buffer usage benchmark, using the default number of measurement and warm-up iterations, forks, and threads; it adjusts the thread count to 4, iteration time to 60 seconds, fixes the heap size at 1GB and profiles the benchmark using JMH's GC, Hotspot runtime and stack sampling profilers.
Download the latest JAR or grab via Maven:
<dependency>
<groupId>com.squareup.okio</groupId>
<artifactId>okio</artifactId>
<version>1.14.0</version>
</dependency>
or Gradle:
compile 'com.squareup.okio:okio:1.14.0'
Snapshots of the development version are available in Sonatype's snapshots
repository.
If you are using ProGuard you might need to add the following option:
-dontwarn okio.**
Copyright 2013 Square, Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.