saalfeldlab/n5-aws-s3

Read from path within a bucket

tischi opened this issue ยท 14 comments

@igorpisarev @axtimwalde

Within the IT infrastructure that we are using we are not allowed to create one bucket per n5 dataset.
We can only create one bucket data and then inside can have objects image01.n5, image02.n5 and so on. To accommodate this we would like to modify N5AmazonS3Reader to add a new constructor that takes one additional argument imageName like so:

public N5AmazonS3Reader(final AmazonS3 s3, final String bucketName, final String imageName, final GsonBuilder gsonBuilder) 

and then prepend the imageName to the pathName, e.g. here:

	@Override
	public boolean datasetExists(final String pathName) throws IOException {
		
		if ( imageName != null )
			pathName = imageName + "/" + pathName;
		
		return getDatasetAttributes(pathName) != null;
	}

Would that work?
Does that make sense?
Would you accept a PR along those lines?

cc @constantinpape

Hi @tischi, indeed, there are no real limitations for an N5 container to be a bucket. It should be possible to allow an arbitrary path in a bucket to be an N5 container, although it introduces certain edge cases, such as:

Currently when an N5 container is deleted, it also deletes the bucket. If we allow to store several N5 containers per bucket, should we still delete the bucket if there is an N5 container at the root level of the bucket that is requested to be removed? Probably yes, since the bucket will be empty anyway, so there is no point in keeping it.
Also I'm not sure if we should allow nested N5 containers, although it seems that it would not be a problem and should not affect anything.

Anyway, I like the proposal, but I would like to look into it a bit more to come up with the right design and make sure that the edge cases are covered. I would also like to make it a bit more general (i.e. instead of passing an additional imageName parameter it might be a bit cleaner to pass a single URL to an N5 container that includes the bucket and the inner path to the container).

@igorpisarev That sounds great! Could you look into it? I think I do not know enough about the subject matter to make informed decisions here.

Yes, I'll get to it some time soon. I will also implement this for https://github.com/saalfeldlab/n5-google-cloud as well, as currently it has the same requirement for an N5 container to be a bucket.

@igorpisarev some thoughts on the questions you brought up:

Currently when an N5 container is deleted, it also deletes the bucket. If we allow to store several N5 containers per bucket, should we still delete the bucket if there is an N5 container at the root level of the bucket that is requested to be removed? Probably yes, since the bucket will be empty anyway, so there is no point in keeping it.

It depends. As far as I can see, there's no way of prevent someone from writing other data to a bucket that has a N5 container at root. So whether deleting the whole bucket is a good idea depends on whether you want to discourage putting other data into the bucket.

Also I'm not sure if we should allow nested N5 containers, although it seems that it would not be a problem and should not affect anything.

I guess it should not make much problems to allow it (or at least to not strictly forbid it). If it were forbidden, one would always need to walk up to the bucket root level and check for all levels in between that they are not the root of a N5 container?

pass a single URL to an N5 container that includes the bucket and the inner path to the container

Makes sense. Is there an AWS API call for this already, or would you need to parse the url manually?

Thanks for your input @constantinpape. I agree, perhaps there should be a separate method for deleting a bucket, or maybe it should be even deferred to the user.
Regarding URL parsing, S3 SDK already provides this functionality: https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/AmazonS3URI.java

OMERO and IDR are likely to be in a similar situation as @tischi in that a single bucket will be available for multiple filesets. Whatever the mechanism to enable it, when in this "multi-use bucket" mode treating the bucket as roughly equivalent to an entire filesystem would be safest when it comes to data safety.

I like the bucket to be removed if it is the root of the n5 container. That is equivalent with making / the root of an n5-container which would attempt to do the same. Storing non-n5 stuff into an n5-container is certainly possible depending on the backend but not something that we should support and therefore encourage. If you do this, you should know what you're doing and deal with it.

Not sure, but maybe this could be a natural API:

public N5AmazonS3Reader(final AmazonS3 s3, final String bucketName, final String key, final GsonBuilder gsonBuilder) 

As this would reflect

public S3Object getObject(String bucketName, String key)

The difference being that in our case the key would not point to an actual object, but be the root path of the n5 image data objects. For the end-users however it maybe ok to think of it as a key, because, in the end, what the library is doing is reading that key, in some sense.

I agree that we should change the N5 root from bucket to bucket + key with the special case that key is empty or /.

@igorpisarev
Do you have a timeline for doing the changes?
We would like to submit a publication that depends on this by end of next week .
No problem if you don't manage until then, I then just need to ship my own branch of this repo with my Fiji update site. Just let us know.

@tischi Yes, I'll get to it either tomorrow or early next week.

@tischi @constantinpape I've added this functionality and released a new version 3.0.0.

Now you can open a specified path in the bucket using the following constructors:

N5AmazonS3Reader(final AmazonS3 s3, final String bucketName, final String containerPath)

or

N5AmazonS3Reader(final AmazonS3 s3, final AmazonS3URI containerURI)

There are also overloads with GsonBuilder if you need to read/store any custom attributes, and the same constructors are available for N5AmazonS3Writer as well.

Everything is tested and should work, but let us know if you run into any problems!

Thanks @igorpisarev! We will try to integrate this into our workflow next week.