thunder-project/thunder

Thunder 1.0.0 Error Loading Multipage Tiffs with "start=" and "stop=" arguments

kr-hansen opened this issue · 9 comments

When loading a multipage tif, I get an error (shown below) if I use the start= or stop= arguments for thunder.images.fromtif(). I can load the whole multipage tif just fine, and I only get this error when I add start= and stop= arguments. I was able to load the same file correctly with thunder 0.6, but it is not working as expected in thunder 1.0.0.

I've tried going through the code to see if I could do my own pull request, but couldn't quite figure out why the error was changing when there was the start= and stop= arguments.

I don't receive this error if I use a directory with single image tiffs saved in the directory. The start= and stop= arguments work as expected in that context.


ValueError Traceback (most recent call last)
in ()
----> 1 imgs = td.images.fromtif(fn, start=1, stop=10, nplanes=1)

C:\Users\Kyle\Anaconda\lib\site-packages\thunder\images\readers.pyc in fromtif(path, ext, start, stop, recursive, nplanes, npartitions, labels, engine, credentials)
371 return frompath(path, accessor=getarray, ext=ext, start=start, stop=stop,
372 recursive=recursive, npartitions=npartitions, recount=recount,
--> 373 labels=labels, engine=engine, credentials=credentials)
374
375 def frompng(path, ext='png', start=None, stop=None, recursive=False, npartitions=None, labels=None, engine=None, credentials=None):

C:\Users\Kyle\Anaconda\lib\site-packages\thunder\images\readers.pyc in frompath(path, accessor, ext, start, stop, recursive, npartitions, dims, dtype, labels, recount, engine, credentials)
208 flattened = list(itertools.chain(*data))
209 values = [kv[1] for kv in flattened]
--> 210 return fromarray(values, labels=labels)
211
212

C:\Users\Kyle\Anaconda\lib\site-packages\thunder\images\readers.pyc in fromarray(values, labels, npartitions, engine)
79
80 if values.ndim < 2:
---> 81 raise ValueError('Array for images must have at least 2 dimensions, got %g' % values.ndim)
82
83 if values.ndim == 2:

ValueError: Array for images must have at least 2 dimensions, got 1

@kkcthans weird, thanks for reporting!

Just so I understand, you're trying to load a single multi-page tif which has, say, 100 images, and you're calling it with start=1 and stop=10 to only load the first 9 images? And it loads the whole thing fine if you don't use those arguments? And it worked in 0.6?

As far as I recall start and stop have always referred to files not images, both in 0.6 and in 1.0, so in your example I'm surprised that it ever worked as you describe! But maybe I'm mistaken.

In the meantime, I'll start playing around with a test multi-page tif.

@boazmohar am I right about this? You do a lot with multi-page tifs.

Yes, that is how I used it, as file indexes. And it works for me in both 0.6 an 1.0.
The only thing that is different is that 0.6 and early 1.0 did a .transpose(1,2,0) when loading a 3d image from tiff assuming that your z is wrongly the first dimension. 1.0 doesn't assume anything which I think is better. see: 1654683

Good to know. I must have been thinking of some of the non-multipage tiff datasets when I used start= and stop= previously on 0.6. When I upgraded to 1.0, I no longer had 0.6 around to double check. My mistake!

@boazmohar Have you noticed if this behavior currently works with thunder 1.0 running pyspark on a cluster? Using 1.0, I can use it for file indexes in local mode as described above, but it seems to break when I pass a Spark context to it.

I have a folder of about 30 multipage tiff image stacks (localdir). I can run both:
imgs = td.images.fromtif(localdir)
imgs = td.images.fromtif(localdir, nplanes=1, engine=sc)

But when I run:
imgs = td.images.fromtif(localdir, engine=sc)

I get the following error:
File "/share/pkg/spark/1.6.0/install/python/pyspark/worker.py", line 111, in main
process()
File "/share/pkg/spark/1.6.0/install/python/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/share/pkg/spark/1.6.0/install/python/pyspark/serializers.py", line 267, in dump_stream
bytes = self.serializer.dumps(vs)
File "/share/pkg/spark/1.6.0/install/python/pyspark/serializers.py", line 415, in dumps
return pickle.dumps(obj, protocol)
SystemError: error return without exception set

@kkcthans I am using it with pyspark in a cluster and it is working for me. But I always pass the nplanes variable. Given it is a multi page tifft what is the behavior you want when not passing nplanes?

@boazmohar I guess I was thinking that it would behave how it behaves in local mode.

When you load a multipage tiff in local mode without the nplanes variable, it loads an array that is (# files, # frames, imdim1, imdim2). # of files can be controlled and indexed by the start= and stop= inputs to fromtif(). Thus, you can still index into a specific file independently after loading the array.

However, when nplanes is used, it is an array that is (# files * # frames, imdim1, imdim2). You can still get to each file by knowing your # frames and slicing appropriately, so I don't know that the first option without nplanes is really necessary. However, I expected it to behave the same in Local and Spark mode.

I would anticipate it would be good to have unity across local and spark modes. When you load a multipage tiff in local mode without nplanes then try to call tospark() it gets the same error as if you tried to load it in spark in the first place.

@freeman-lab I would suggest either removing the ability for local mode to load images in this manner, or add functionality to Spark mode for it to match local mode behavior regarding how using nplanes or not is handled.

@kkcthans I have now tried it without passing nplanes and it working for me as long as the number of planes per file is constant.
I am on the latest commit of Thunder and Bolt, but I don't think this part of the code changed lately.

data = td.images.fromtif(session.path+'Test',start=0, stop=2, engine=sc)
data
out: 
Images
mode: spark
dtype: int16
shape: (2, 6048, 36, 72)

@kkcthans What is the size of your files? I guess that if it is working in local mode they are not huge. Pickle errors are vey cryptic. If you are willing to share the files in some way I could try loading them to see if the problem persists.

@boazmohar They are very large files. They are each on the scale of (1200, 1024, 1024). I could share a zip folder with some of the files, or you could download it directly from our lab's website (http://www.bu.edu/hanlab/resources/). The Software Demo Image Files link on that webpage should have similar sized files to what we are using.

It is likely a size thing. This isn't urgent for me, since using nplanes will fit my needs, but I thought it was worth bringing up.