saalfeldlab/n5-aws-s3

Question about isContainerBucketRoot()

tischi opened this issue ยท 16 comments

@igorpisarev @axtimwalde @constantinpape

I am trying to use your new version of this library and have some issues.

Here:
https://github.com/saalfeldlab/n5-aws-s3/blob/master/src/main/java/org/janelia/saalfeldlab/n5/s3/N5AmazonS3Reader.java#L209

There has been this added:

&& (isContainerBucketRoot() || exists("/"))

Could you please explain what that does and what the purpose is?

This evaluates to false for us in this case:
image
where key is what you are calling containerPath.

The other issue that the call exists("/") in above line internally calls s3.listObjectsV2(listObjectsRequest) which, for us, takes forever to complete (many seconds, sometimes fails with timeout); not sure why. Do you know what the listObjectsV2 call is doing?

Thanks for reporting this. To me this looks as if that needs fixing. How do you create the N5AmazonS3Reader?
is key meant to be the containerPath or do you double use it as a dataset? "/" datasets are legal but considered a bit of an edge case that we haven't tested much. I could imagine that, if this is how you use N5, you are listing into a dataset which will take a while because it would have to filter through all the block subdirectories. A quick fix for this would be to test datasetExists first? @igorpisarev has moved on to a new job and I am not sure if he is still following this. Can you detail your application a bit more?

I create like this:

new N5AmazonS3Reader( s3, bucketName, containerPath )

with, e.g.

bucketName = "platyBrowser"
containerPath = "rawdata/sbem-6dpf-1-whole-raw.n5"

Our data structure beneath looks then like this, e.g.:

rawdata/sbem-6dpf-1-whole-raw.n5/setup0/attributes.json

you are listing into a dataset which will take a while because it would have to filter through all the block subdirectories.

Does that mean that this

final ListObjectsV2Request listObjectsRequest = new ListObjectsV2Request()
				.withBucketName(bucketName)
				.withPrefix(prefix)
				.withMaxKeys(1);
final ListObjectsV2Result objectsListing = s3.listObjectsV2(listObjectsRequest);

interprets the / slashes in the keys as directories and then goes into them? I always thought object stores do not have a folder structure? If that's the case maybe we should reconsider the whole code in exists( pathName )?

What is your definition of a "dataset" vs. "containerPath==key"?
I have the feeling we are using them as synonyms, maybe you had something else in mind?
I think the way we are currently using the API this makes not much sense: https://github.com/saalfeldlab/n5-aws-s3/blob/master/src/main/java/org/janelia/saalfeldlab/n5/s3/N5AmazonS3Reader.java#L385
because it would always be false.

Sorry, I need to call it a day for now, can work more on it tomorrow!

Hi @tischi, @axtimwalde,
looking at the code now, I remember that it would check the container version only if it's a non-empty directory tree. isBucketContainerRoot() was thought to be a special case when a bucket is itself an N5 container (i.e. attributes.json is hosted at the root of the bucket).
I'm also not sure anymore if this check is really needed, you can try to simplify or remove it and validate with unit tests, they should cover everything pretty well.

There was another consideration for the case when the constructor of N5AmazonS3Reader is called as a result of creating an instance of N5AmazonS3Writer, and because it's a superclass, the constructor of the reader runs before the constructor of the writer, so the container may not exist yet. The unit tests also cover this.

That said, I'm not sure why it evaluates to false in your case -- perhaps there is indeed a bug somewhere. Hope the above helps to figure it out! As @axtimwalde mentioned, I've recently left Janelia and will not be involved in modifying this code.

only if it's a non-empty directory tree

I think that's fine in theory, but in practice this call s3.listObjectsV2(listObjectsRequest); seems not performant.
I could be we would need another way to figure out whether a path exists.

@constantinpape Where do we currently specify the version (I guess that's a small file called version or something)? For each dataset or only at the bucket root? I guess we should have it for each dataset, right?

Where do we currently specify the version (I guess that's a small file called version or something)? For each dataset or only at the bucket root? I guess we should have it for each dataset, right?

Are you referring to the n5 version?
Each dataset container (=each file with '.n5' extension) has it stored in the root level in the attributes.json. See the n5 specification point 3.

Edit: @tischi and me tend to use the term "dataset" differently than in the n5 spec. According to the n5 spec the correct term here is container.

@axtimwalde
For us, it works to simply remove the additional checks. I created a PR for this: #9
Would that work for you or do you think they are needed?

Are you referring to the n5 version?
Each dataset (=each file with '.n5' extension) has it stored in the root level in the attributes.json. See the n5 specification point 3.

Each N5-container has one n5 attribute in its root group. An N5-container is NOT a dataset, it can be a dataset if it has only one dataset whose name is "" or "/". Datasets are the chunked n-dimensional data, they are also groups in how they have attributes but they can not contain other groups. So the above statement is confusing and partially wrong.

Datasets are the chunked n-dimensional data, they are also groups in how they have attributes but they can not contain other groups. So the above statement is confusing and partially wrong.

Yes, you are right. I am mixing up the terms here because in our use-case we always only have a single dataset per n5 container so we tend to use the terms as synonyms.

I will update my comment to make that clear.

Great! Unrelated to this issue which I consider fixed by @tischi's pull request, do you think that this is good practice? We made N5 a container format, so datasets can be hierarchically sorted inside the container. This is also why we started AWS-S3 and Google-Cloud support with bucket==container, because we assumed that you can pack everything you want into the hierarchy of a container. Of course this does not include any non-N5 files if you have any...

I agree, the single dataset per container is not good practice in general.

For the project at hand, we started out with the bdv.hdf5 format which only supported up to 100 setup ids per file, which was not sufficient for us, and @tischi wrote a gui on top of bdv to deal with multiple files. We have switched to the bdv.n5 format now, which would allow for an unlimited number of setup ids, but haven't switched this layout, because this would require some changes in the gui and there is not really an obvious mapping of datasets to setups.

Also what I wrote above that we have a single dataset per container is not quite correct, we have a single bdv setup per container, which actually corresponds to a hiearchy like this:

setup0/timepoint0/s0
        /s1
        /s2
...

where s0..sN are the datasets for multiple resolutions.

@axtimwalde
Yes all of this is historical as we came from bdv.h5 where having everything in one giant h5 file has, imho, a couple of disadvantages, including the difficulty of parallel writing into it.

Now, using n5, what do you think should be best practice?

one container (OC)

Using only one bdv.n5 container would look like this, right?

myProject.n5
  /setup0
  /setup1
  ...

multiple containers (MC)

What we do looks like this:

myProject
  /em-raw-highres.n5
  /geneXYZ.n5
  /geneABC.n5

At first glance MC seems to have the advantage that one can immediately, from the container names, get a sense of the data that's in there. While for OC one would have to go into the attributes.json to get more information on the data content, right? Working with biological end-users on a daily basis we made the experience that it is useful if there is human readable information in file- and folder names. Of course, in the specific n5-aws-s3 setting where end-users would most likely not directly access the data this would be less of an argument.

Also, the way BigDataViewer currently works, it would open all the setups at once, which could be slow, overwhelming, and would explode its brightness and contrast UI which used to be limited to 10 setups (this should now be better since version 9.*, but I have not tested it yet).

But those are not very strong arguments. I would be curious what you think are the main advantages of having everything in one container?

If your intention is to make each container a BDV project, then MC makes most sense, and there is nothing wrong with that. I was assuming that you have some custom structure that you open with a dedicated frontend that uses bigdataviewer-vistools to display a bunch of sources. That would also free you from the setup naming convention and allow arbitrary human readable names. We are currently working on making the n5-viewer plugin more flexible so you can do exactly that, open arbitrary single- or multi-scale datasets from arbitrary locations in an N5 container. Late for your launch I guess :).

We are currently working on making the n5-viewer plugin more flexible so you can do exactly that, open arbitrary single- or multi-scale datasets from arbitrary locations in an N5 container.

This sounds good and is indeed what I would need to make the OC version work for me.
Frankly, I have never tried the n5-viewer. Could you briefly describe what the differences are and why it was necessary to branch from the normal bigdataviewer?

N5-viewer was created when we had just created the first N5 draft and BDV proper was unaware of it. Its purpose was to visualize large multi-scale exports of cloud-stitched multi--channel LLSM data. It also has a crop tool which is handy for users in Fiji who want to work on some Fiji-size crop from the large volume. It is BDV with other assumptions about the source format and other sources, and also a crop tool. We are currently generalizing this into an all purpose N5 viewer which includes HDF5, Zarr and the cloud backends supported through the N5-API.