tee-ar-ex/trx-python

Benchmark against faster Python trk loader

Opened this issue · 14 comments

At some point, should compare with @emanuele's https://github.com/emanuele/load_trk, for a comparison for a more optimized Python trk reader.

Hi @arokem , I've just updated my repository with two additional implementations that visibly speed up the loading. For now, I just named them load_trk_new.py and load_trk_numba.py. The simple test test_load_trk_numpy_save.py compares them and also reports the timing of loading the same amount of data with numpy.load(), which I consider the lower bound. The output of this test executed on my laptop (SSD) appears in the README. Let me know if I can be of help for your future comparisons!

Hi @emanuele, I noticed that the numba version is loading all the data in memory first before processing compared to the other methods that only read the requested streamlines from the file (via a sequence of file seeking + reading).

@MarcCote , true indeed. The numba version is by no means complete - at the moment it uses too much memory. But in practice, it is the most frequently used in our lab, because it loads large tractograms in a few seconds and memory is not the main issue, a sort of trade-off between time and memory. In the future, I'll try to improve on that side. Thanks for the feedback!

I made some improvements on parsing the streamline lengths. It is kind of exploiting how uint32 and float32 represent the numbers in binary. Here's the relevant part. I'm curious to see what you guys think, I'm probably missing some corner cases, right?

I have created an experimental JavaScript implementation that includes a benchmark against other formats. These results may be very different from Python. In particular, JavaScript does not natively support the uint64 offsets required by TRX nor the float16 used by the default converter. I realize this issue is specifically about Python, but each of you may have insights regarding the JavaScript implementation and opportunity for optimizations. One could also use the same datasets for evaluation of both the native Python and C++ implementations. The trx32 shows performance if both offsets and positions use 32-bit values (though this is not allowed with the current standard).

M1

Pretty cool demo. It is intriguing why the trk and tck perform so much better compared to trx! Do you have an intuition why?

Also, I have a naive question. Did you compare the loaded streamlines from the different files to make sure the data was properly loaded? I don't see that in trx_bench.mjs

The TRK and TCK have no compression. The performance of both TRK and TCK are actually pretty slow, because they interleave the track length and vertex position information. They just look fast relative to compressed formats. Segmenting the offsets and the positions makes them a lot faster. The VTK requires a lot of disk space because tff_convert_tractogram uses float64 for the vertex positions.

For this experiment were the TRX compressed (deflate)? How does it compare if not compress (stored)?

And does JavaScript supports memmap (reading a whole chunk to memory at the time) or it has to actually read the bytes into memory (sequential)?

@MarcCote: Did you compare the loaded streamlines from the different files to make sure the data was properly loaded?

The JavaScript code is used by NiiVue which allows users to drag and drop TRK, TCK, VTK, FIB and TRX streamlines using any web enabled device (phone, tablet, computer). This provides a zero-footprint test for these formats. The minimal demo also allows you to visualize DPS (properties) and DPV (scalars) of TRK and TRX files, if they are available.

@frheault

  1. I have updated the benchmark to compare trk, and trk.gz with 3 levels of gzip compression (1 fastest, 6 default, 11 slowest) as well as TRX with 3 levels of zip compression (1 fastest, 4 default, 6 slowest).
    M1 In brief, trx level 1 is slightly slower than trk.gz while trx levels 4 and 6 are particularly slow.
  2. The compressed benchmarks are about 50% faster than before - I switched to fflate instead of Pako/JSZip. This influences both trx and trk.gz, so the overall story does not change.
  3. For C programs, zstd offers dramatically faster decompression than gzip. Like gzip, it is a single file method, so not compatible with trx. However, I did try trk.zstd using JavaScript fzstd - I include the comparison below. zstd does allow remarkable compression, but the JavaScript decompressor is not as fast as fflate.
    M1zstd
    The finding that the default (3) zstd is faster than both levels fast (1) and slow (19) zstd is surprising but reliable.
  4. I think my JavaScript TRX reading implementation is complete. I was thinking of generating a pull request to this repository. Can you see any missing features? Again, this is available as a drag and drop web page viewer, which can help others support this specification.

@neurolabusc Thank you for this, I will dig into it this weekend!

I was wondering if it was possible for trx to have the ''0'' compression (Using STORED instead of DEFLATE)? In python, thanks to the zip header we don't have to unzip at all and all files are accessible directly (just like if it was a folder).

This would be closer to trk/tck without compression, like the 'base' file format rather than the choice of compression we put on top of it.

@frheault This is a good point.

My original test was with the raw output from your tff_convert_tractogram script, which does use STORED not DEFLATE. However, I only tested uncompressed TRX with pako, and did not try this later when I discovered the faster fflate. I retested with fflate, and does a great job. The residual big performance penalty to TRX is the use of float16, which is not native to JavaScript (or x86-64). The TRX32 shows the performance when uint32 is used for offsets and float32 for position. With these modifications, the size is the same as uncompressed TRK (as they are both uncompressed and have the same precision) but the speed of TRX is much faster (since the scalar, property, position and line length information are not interleaved, and can each be read with a single block read):

msec x10 mb
TRK 4060 137.4
TRX 14712 69.9
TRX32 499 137.4

I would strongly advocate for the specification to be changed to optionally allow offsets to be UINT32 (as this is the maximum allowed by most graphics languages, is sufficient for most processed datasets, and results in smaller files). Likewise, it would be great if your converter provided the option to allow the user to choose the float32 precision for positions.

As an aside, your latest modifications to tff_convert_tractogram.py lead to a segmentation fault when I try to convert any tracts, for example from here:

python ./tff_convert_tractogram.py dpsv.trx dp.trk      
zsh: segmentation fault  python ./tff_convert_tractogram.py dpsv.trx dp.trk

This is a regression as an early commit (14f7cac) works fine.

That regression is a good reminder that we need tests for everything. At least smoke testing.

I did a little research, while x86-64 does not natively support float16, it includes the SSE/AVX instructions for converting half (float16) to single precision (float32) _mm_cvtph_ps/_mm256_cvtph_ps and the reverse _mm_cvtps_ph/_mm256_cvtps_ph. Likewise, ARM has similar Neon instructions VCVT_F32_F16 and VCVT_F16_F32. This would allow low level languages like C and sophisticated libraries like Numpy to rapidly decode/encode float16s. Unfortunately, these are not available in JavaScript.