tee-ar-ex/trx-python

Invalid streamlines for software

Opened this issue · 7 comments

This is outside of the scope of the specification, but it is related to what is written inside of the 'container'.

Even if it is true that someone could write what he/she wants in the file and get away with it for internal use, it could be nice to discuss what we DO NOT want to see when we share files.

  1. For example, at the SCIL (and to some extend in Dipy), streamlines with 0 points are problematic (I think it cannot be written or loaded by Nibabel. I believe no software should play with this idea. In the TRX that would be achieved by having two sequential identical offset values. This should absolutely be against the specification, offsets should have unique, always increasing values.

  2. At the SCIL, streamlines with 1 point are not considered streamlines. It can be written in TRK/TCK and is somewhat useful for some visualization (Workaround some tools to show only endpoints) and mostly debug. I think that for "normal" data generation of tractogram streamlines with only 1 point should be removed by default. Leaving a --debug_single_point is acceptable.

  3. Streamlines with 2 perfectly overlapping point, a step-size below 1x10^-6, are problematic for some software. Furthermore, since TRX supports float16 this would become perfect zero and that could be a problem. I think these should be removed by default. Leaving a --debug_overlap_point is acceptable.

  4. Finally, it would be nice that streamlines do not end up out of the volume used to generate them. That's true in RASMM as well as VOX. Why would a streamline be outside of the data? Of course, it can be written and even loaded in most software and be usable data. BUT it would be nice if we had an agreement not to write streamline like that in "normal" data generation of tractogram. Leaving a --debug_outside_bbox is acceptable.

Related to this issue, I saw more than once single point streamlines or entire streamlines at -INF or +INF, or streamlines having their last/first points at -INF or +INF, which was causing rendering issues with automatic camera position (MITK, TrackVis), etc. So removing them not only makes sense it could help a bit.

Internally we have a scil_remove_invalid_streamlines.py that removes any streamline of these 3 cases, so I am sure we can deal with this ourselves. But in an effort to bring more unity to the field and facilitate the exchange of data and limits corner case for everyone in every tools/library it would be nice that the default case is more stable.
This is similar to reserved keyword/tag for metadata and header, nothing is forcing anyone except the desire to make it easy to exchange data.

PS : I mentioned the TRX for 'fun fact', but this is actually relevant for any file format past, present or future.

@jchoude @mdesco in your experience with weird datasets, phantoms, challenges and clinical data have you ever seen something that could be relevant here?

It would not be mandatory, I don't think anyone wants the file format to magically crash by itself if these happen. But having an agreement on what no one likes could be useful when its time to implement their reader/writer and other I/O functions.

What you mention above covers (I think) most weird cases I'm seen in the latest years. If you need, I could find data with such examples.

I appreciate these might cause trouble with various software implementations. Personally though, I think as long as the format allows for zero or length 1 streamlines without being overtly malformed, implementations will need to be robust to that. Murphy's Law dictates they're going to happen...

Same with coordinates that end up outside the FoV of the original dataset: there's nothing actually wrong with that, and it's hard to even detect without access to the original dataset and explicitly checking for this. What if the vertex is right on the boundary, but ends up outside due to a simple rounding error? But more generally, users are bound to do this eventually, and they may even have legitimate reasons for doing so that we haven't anticipated.

Yes, we could mandate that compliant data should not do any of the above, but I expect someone will eventually do just that - maybe they've not read the spec. Or more likely, they may not have bothered to perform the sanity checks this would require when generating the data. Regardless of what the specification might state, I don't think there's any way around making implementations robust to these cases.

Just my 2¢, obviously...

I totally agree that the file format should not be limited to avoid explicitly these cases (which are bound to happen).

But I think that the main "generator" of tractogram could make an effort to avoid these cases. Since we do have access to the data (since we are generating tractogram) and that such checks are easy to do as each streamline get generated we could avoid being the ones introducing such 'corner case'.

So I understand now that this is outside of the scope of the specification, but my point can then be switch to implementation. Do you think avoiding these cases (in mrtrix for example) by adding a few checks at save time of tckgen would have a downside? (skipping the checks could be optional)

But I think that the main "generator" of tractogram could make an effort to avoid these cases.

Do you think avoiding these cases (in mrtrix for example) by adding a few checks at save time of tckgen would have a downside?

I think it's important here to establish that algorithms that perform tractography to generate streamlines from DWI data are not the only sources of streamlines data. There are multiple other processes for which streamline vertices are an output.

The one that caught us out was non-linear spatial transformation of streamlines vertices. In the ideal case, this is merely a shift in position of each vertex individually, and data-per-streamline / data-per-point (-vertex) would not only not be affected but also have correspondence to both pre-transformation and post-transformation vertex data. However, vertices of a streamline may lie outside of the volume of valid non-linear transformation information. In the case of the .tck file format, non-finite values (NaN, Inf) are reserved as delimiters, and hence retaining such vertices in-place but with "invalid" values was not possible. So removal of those vertices from the streamlines data was necessary, which immediately breaks correspondence between data-per-point (/-vertex) data and streamlines vertices. If all vertices of a particular streamline lie outside this region, then one can either write a zero-vertex streamline (now our solution: MRtrix3/mrtrix3#2159) or break correspondence between data-per-streamline and vertex data. Now in the case of something like TRX, you could detect that such is occurring and re-write all contained DPP / DPS data as necessary, hence "preventing the breaking of correspondence between associated data"; but it would nevertheless still break the implicit correspondence between the original streamlines data source and that result.

In my case all I do is report on the presence of zero-vertex and single-vertex / "zero-length" streamlines if the user runs a command that calculates streamlines lengths (tckstats). I would similarly endeavour to issue warnings for any calculations where such data actively preclude such calculations, though I certainly can't guarantee I've caught all cases. But similarly to @jdtournier I'd prefer that the awkwardness of such cases, or existing assumptions of specific softwares, not lead to explicit forbidding of such data within the specification.

If TRX does not use delimiters and hence NaN-valued vertices could be utilised in such a scenario, that would potentially be more faithful to the originating data; but if softwares currently err with single-vertex streamlines or streamlines with two vertices that are too close to one another, they would almost certainly have issues with NaN-valued vertex locations; so that would not mitigate the need for greater care in software implementations.


For fun's sake, another interesting use case is vertex masking, i.e. using a binary image to modulate which streamline vertices are retained and which should be removed from the dataset. Masks are often used during tractography to control when to stop propagation, but there's no theoretical reason why this process can't be applied post hoc to include / exclude streamline vertices. But there's a trap: a single streamline input can be bisected as a result of the masking process. With our code currently, in this case a single streamline in the input file can result in multiple streamlines in the output file, breaking DPS correspondence. The alternative (if permissible in a format) would be to NaN-fill the excluded vertices. Maybe more faithful to the process would be to produce a DPP mask based on that mask image, but e.g. if one wants to use such a mask to control which streamline vertices to display, there then needs to be a software mechanism in place to modulate whether or not each vertex is displayed based on DPP data.

Do you think avoiding these cases (in mrtrix for example) by adding a few checks at save time of tckgen would have a downside?

Not really a big deal, no. These corner cases can be useful in some contexts (e.g. just storing the seed point locations when no streamline was generated), but the current framework provides alternative and better ways of doing this anyway. Not storing streamlines outside the original dataset is what happens anyway, but we do often still need to handle cases where the streamlines are being mapped or transformed onto different datasets whose FoV don't match. So I don't think that needs any special handling at write-time, but it definitely requires handling where required at read time.

Just to be clear: I have no objection to the spec providing a list of undesirable cases like this. I just don't think it will therefore remove the need for validation when loading these data.

I agree that streamlines outside of volume is quite common for registration. My point was specifically only for data generation, at save time during the tractography process, same with overlapping points or zero/single step streamlines. These "cases" would only be undesirable (but still allowed by the file format) straight out of tckgen or local_tracking in Dipy for example. Advanced processing such as connectomics or nonlinear transformation are more complex, so in my mind I was not requesting anything concerning these operations.

I will not include these as undesirable in the specs since I was proposing to try to avoid these cases only at tractography generation, the very first step only. And only in normal cases, default config tracking. And since various operations will likely still get into these corner cases (rightfully or not) it is true that code will have to handle it one way or another.

PS: When I said -INF +INF, it was also -65536/65536 (highest allowed values, but I saw that more often with trk than tck)