brendan-duncan/archive

Provide sync variant of OutputFileStream.close()

Opened this issue · 6 comments

Currently the OutputFileStream.close() method is async, awaiting on the async RandomAccessFile.close() method.

However there is a RandomAccessFile.closeSync() variant that allows to have OutputFileStream.closeSync() variant to not break the current API.

Or maybe just make the current close() method not asynchronous. Especially given that all other operations on the RandomAccessFile are the sync ones.}

This will also allow to make the ArchiveFile.close() method non async (currently the only async method there) or have the ArchiveFile.closeSync() variant.

InputFileStream has closeSync, so I'll add that to OutputFileStream. Not sure why it doesn't have it already, just neglect. I agree the whole API is a mess with sync / async. One of these days I'll do a big 4.0 update and clean up the mess. Probably won't happen for a while.

Thank you. If you could add ArchiveFile.closeSync() that would be great too...

On a side note, I'm somewhat puzzled what is the most performant way to zip bunch of files on a local file system (i.e. dart:io). It looks like, using InputFileStream with ArchiveFile seems slower than File.readAsBytesSync() and then pass the bytes to ArchiveFile. Similarly, the ZipEncoder seems to be faster with in-memory OutputStream and then File.writeAsByteSync() vs OutputFileStream.

Additionally, it is unclear when to close that InputFileStream passed to ArchiveFile that was added to a ZipEncoder. Maybe the ZipEncoder.addFile() could use something like optional autoclose flag to release these opened files. Currently it is really easy to get "too many open files" error with those files unclosed.

File streaming will always be slower than just reading all the bytes into memory. File IO is particularly slow, especially with Dart. The problem is that big files take lots of memory, too much memory a lot of times. So it is a game to play; do you read the entire file into memory, or do you stream it?

I'll take a look at an autoclose option.

I pushed to git adding closeSync to OutputFileStream and ArchiveFile. And I added autoClose named arg to ZipEncoder encode and addFile methods.

File streaming will always be slower than just reading all the bytes into memory. File IO is particularly slow, especially with Dart. The problem is that big files take lots of memory, too much memory a lot of times. So it is a game to play; do you read the entire file into memory, or do you stream it?

I know the IOSink is slow and there are several stale issues in the dart-lang about it, e.g. dart-lang/sdk#32874 or dart-lang/sdk#17951

However, if you are in the dart:io namespace, you could use File.openSync(mode: FileMode.write) that returns a RandomAccessFile and then use RandomAccessFile methods writeFromSync(), readSync() and readIntoSync() instead of using IOSink/Stream from the File.openRead/openWrite() methods.

In my tests, writes with RandomAccessFile are over 10x faster compared to IOSink (that is on OSX MBP with SSD).

Another approach is to buffer IOSink. I get significant improvement with many small writes. Though it is about 2..3x slower then RandomAccessFile using 4..8mb buffer.

Here is my naive (not fully tested) BufferedSink POC:

class BufferedSink implements IOSink {
  BufferedSink(this._sink, [this._bufferSize = 4 * 1024 * 1024])
      : encoding = _sink.encoding,
        _buffer = Uint8ClampedList(_bufferSize);

  final IOSink _sink;
  final int _bufferSize;

  Uint8ClampedList _buffer;
  int _offset = 0;

  @override
  Encoding encoding;

  @override
  void add(List<int> data) => _addInternal(data);

  @override
  void addError(Object error, [StackTrace? stackTrace]) => _sink.addError(error, stackTrace);

  @override
  Future addStream(Stream<List<int>> stream) => stream.listen(add, onError: addError, cancelOnError: true).asFuture();

  @override
  Future get done => _sink.done;

  @override
  Future flush() {
    _flushInternal();
    return _sink.flush();
  }

  @override
  Future close() {
    _flushInternal();
    return _sink.close();
  }

  @override
  void write(Object? object) => add(utf8.encode(object.toString()));

  @override
  void writeAll(Iterable objects, [String separator = '']) {
    var isFirst = true;
    for (var object in objects) {
      if (isFirst) {
        isFirst = false;
      } else {
        write(separator);
      }
      write(object);
    }
  }

  @override
  void writeCharCode(int charCode) => write(String.fromCharCode(charCode));

  @override
  void writeln([Object? obj = '']) {
    if (obj != '') {
      write(obj);
    }
    write('\n');
  }

  void _addInternal(List<int> data) {
    if(data.length > _bufferSize) {
      if(_offset > 0) {
        _sink.add(_buffer.sublist(0, _offset) + data);
        _buffer = Uint8ClampedList(_bufferSize);
        _offset = 0;
      } else {
        _sink.add(data);
      }
    } else {
      int maxLength = _bufferSize - _offset;
      int length = min(data.length, maxLength);
      int end = _offset + length;
      _buffer.setRange(_offset, end, data.sublist(0, length));
      _offset = end;

      if (end == _bufferSize) {
        _flushInternal();
      }

      if(data.length > length) {
        _buffer.setRange(0, length, data.sublist(length));
        _offset = length;
      }
    }
  }

  void _flushInternal() {
    if(_buffer.isNotEmpty) {
      _sink.add(_buffer);
      _buffer = Uint8ClampedList(_bufferSize);
      _offset = 0;
    }
  }
}

And the benchmark is like this:

void main() async {
  File file = File("foo.bin");

  Stopwatch sw = Stopwatch()..start();

  // IOSink out = BufferedSink(file.openWrite(), 8 * 1024 * 1024);
  RandomAccessFile out = file.openSync(mode: FileMode.write);
  for (int i = 0; i < 40000; i++) {
    out.add(Uint8List(65536));
  }
  await out.close();

  print("total time: ${sw.elapsedMilliseconds}");
}

extension _RAF on RandomAccessFile {
  void add(List<int> data) => writeFromSync(data);
}

The RandomAccessFile is already used with the InputFileStream and OutputFileStream... So, to clarify, the above comment was mostly about current limitations of the existing implementations of InputStreamBase. I have to either provide data buffer or a file (e.g. to be added to archive).

But it would be nice to have one that takes an IOSink + an in-memory buffer, then consumer could write to it without loading entire content into memory or creating a file.