saalfeldlab/n5-aws-s3

Writing n5 dataset creates link to n5 directory inside n5 directory (minio)

karlduderstadt opened this issue · 2 comments

First of all, thanks for this awesome n5 writer to aws. We are using it to write data to a minio server that accepts the same protocol. For some strange reason when we write datasets, we always get an extra file inside the n5 directory linking back to the same directory. I added a screenshot from the minio web interface:
Screenshot 2023-10-19 at 16 42 10

Here I wrote the dataset.n5 to the bucket cmg and you can see when I open it that I also have a file called dataset.n5. When I click that file it seems to link back to the same directory. Do you know why this is created? Is it possible for me to prevent creation of this file? It is not needed and when I try and delete it, the whole dataset.n5 is deleted.

Before I try and dig deeper, I wanted to see if you know why this might happen and whether this is a specific issue with minio as compared to aws.

I tested with the latest version in scijava pom 37 and had same issue as with ones included with Fiji.

I am using the following script to write the data. This converts all micromanager positions (tif) sequences and adds them to the n5.

#@ String filepath
#@ String bucketName
#@ String basePath
#@ DatasetIOService ioService

import org.janelia.saalfeldlab.n5.*
import org.janelia.saalfeldlab.n5.imglib2.N5Utils
import java.util.concurrent.Executors
import org.janelia.saalfeldlab.n5.s3.*

import com.amazonaws.auth.AWSCredentials
import com.amazonaws.auth.AWSStaticCredentialsProvider
import com.amazonaws.auth.AnonymousAWSCredentials
import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
import com.amazonaws.regions.Regions
import com.amazonaws.services.s3.AmazonS3
import com.amazonaws.services.s3.AmazonS3ClientBuilder
import com.amazonaws.services.s3.AmazonS3URI
import com.amazonaws.client.builder.AwsClientBuilder

def inputPath = new File(filepath)
def positions = []

String endpoint = "http://server:9000"

if (inputPath.isDirectory()) {
        def files = inputPath.listFiles()
        for (file : files)
                if (file.isDirectory())
                        positions.add(new File(file.getAbsolutePath() + "/metadata.txt"))

    if (positions.size() == 0)
                        positions.add(new File(inputPath.getAbsolutePath() + "/metadata.txt"))
} else {
        positions.add(inputPath)
}

AmazonS3 s3
AWSCredentials credentials = null
try {
    credentials = new DefaultAWSCredentialsProviderChain().getCredentials()
} catch(final Exception e) {
    System.out.println( "Could not load AWS credentials, falling back to anonymous." )
}
final AWSStaticCredentialsProvider credentialsProvider =
        new AWSStaticCredentialsProvider(credentials == null ? new AnonymousAWSCredentials() : credentials)

//US_EAST_2 is used as a dummy region.
s3 = AmazonS3ClientBuilder.standard()
        .withPathStyleAccessEnabled(true)
        .withEndpointConfiguration(new AwsClientBuilder.EndpointConfiguration(endpoint, Regions.US_EAST_2.getName()))
        .withCredentials(credentialsProvider)
        .build()

for (file : positions) {
        println "Processing " + file.getAbsolutePath()

        if (!file.exists()) {
                println "Unrecognized format, could not locate metadata.txt file."
                continue
        }

        def dataset = ioService.open(file.getAbsolutePath())
        println "Dimensions " + dataset.dimensionsAsLongArray()

        def blockSize
        if (dataset.numDimensions() == 5) {
                blockSize = new int[] {128, 128, 1, 1, 64}
        } else if (dataset.numDimensions() == 4) {
                blockSize = new int[] {128, 128, 1, 64}
        } else if (dataset.numDimensions() == 3) {
                blockSize = new int[] {128, 128, 64}
        } else {
                println "Unexpected number of dimensions " + dataset.numDimensions() + ". Aborting."
                continue
        }

        N5Utils.save(
            dataset.getImgPlus(),
            new N5AmazonS3Writer(s3, bucketName, basePath),
            file.getParentFile().getName(),
            blockSize,
            new GzipCompression(),
                  Executors.newFixedThreadPool( 30 ))
}

println "Done converting dataset to N5"
System.exit(0)

Any feedback, hints or tips would be most welcome. Also, letting me know if you have never seen this issue.

I investigated a bit more and I see that this internal file/link is created when directories are created in the method:

public void createDirectories(final String normalPath) throws IOException {
String path = "";
for (final String component : components(removeLeadingSlash(normalPath))) {
path = addTrailingSlash(compose(path, component));
if (path.equals("/")) {
continue;
}
final ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentLength(0);
s3.putObject(
bucketName,
path,
new ByteArrayInputStream(new byte[0]),
metadata);
}
}

So when you create directories with no content on minio it creates this internal link back to the directory it just created. I am assuming this doesn't happen on aws, so it hasn't been an issue in the past.

But why do you need to create directories in the first place? As I understand it, every time you put an object it will create the path and, in fact, folders don't really exist on aws, but just longer keys. Is there a way to prevent the directory creation step?

Ahh ok, there appears to be an issue on minio with this question. I guess this is a result of differences between aws and minio - minio/minio#2423

So I will close the issue. But I am happy for any thoughts since I believe others using self hosted minio servers might also run into this problem.

My current workaround for this issue is to delete the directory objects after writing the N5 dataset. Writing the N5 creates directory objects for each subdirectory so it has to be done for all of them. I have now added this at the bottom of my script using the s3.deleteObject function. For completeness, in case it helps someone else, I included my revised script below.

I think these objects are very confusing on minio storage, so I prefer to remove them. Also, deleting them in the web interface seems to delete all the objects in the directory and subdirectories, which was somewhat unexpected. Removing them right after writing, removes all this confusion.

#@ String filepath
#@ String bucketName
#@ String basePath
#@ DatasetIOService ioService

import org.janelia.saalfeldlab.n5.*
import org.janelia.saalfeldlab.n5.imglib2.N5Utils
import java.util.concurrent.Executors
import org.janelia.saalfeldlab.n5.s3.*

import com.amazonaws.auth.AWSCredentials
import com.amazonaws.auth.AWSStaticCredentialsProvider
import com.amazonaws.auth.AnonymousAWSCredentials
import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
import com.amazonaws.regions.Regions
import com.amazonaws.services.s3.AmazonS3
import com.amazonaws.services.s3.AmazonS3ClientBuilder
import com.amazonaws.services.s3.AmazonS3URI
import com.amazonaws.client.builder.AwsClientBuilder
import com.amazonaws.services.s3.model.DeleteObjectsRequest

def inputPath = new File(filepath)
def positions = []

String endpoint = "http://server:9000"

if (inputPath.isDirectory()) {
	def files = inputPath.listFiles()
	for (file : files)
		if (file.isDirectory())
			positions.add(new File(file.getAbsolutePath() + "/metadata.txt"))

    if (positions.size() == 0)
			positions.add(new File(inputPath.getAbsolutePath() + "/metadata.txt"))
} else {
	positions.add(inputPath)
}

AmazonS3 s3
AWSCredentials credentials = null
try {
    credentials = new DefaultAWSCredentialsProviderChain().getCredentials()
} catch(final Exception e) {
    System.out.println( "Could not load AWS credentials, falling back to anonymous." )
}
final AWSStaticCredentialsProvider credentialsProvider =
        new AWSStaticCredentialsProvider(credentials == null ? new AnonymousAWSCredentials() : credentials)

//US_EAST_2 is used as a dummy region.
s3 = AmazonS3ClientBuilder.standard()
        .withPathStyleAccessEnabled(true)
        .withEndpointConfiguration(new AwsClientBuilder.EndpointConfiguration(endpoint, Regions.US_EAST_2.getName()))
        .withCredentials(credentialsProvider)
        .build()

for (file : positions) {
	println "Processing " + file.getAbsolutePath()

	if (!file.exists()) {
		println "Unrecognized format, could not locate metadata.txt file."
		continue
	}

	def dataset = ioService.open(file.getAbsolutePath())
	println "Dimensions " + dataset.dimensionsAsLongArray()

	def blockSize
	if (dataset.numDimensions() == 5) {
		blockSize = new int[] {128, 128, 1, 1, 64}
	} else if (dataset.numDimensions() == 4) {
		blockSize = new int[] {128, 128, 1, 64}
	} else if (dataset.numDimensions() == 3) {
		blockSize = new int[] {128, 128, 64}
	} else {
		println "Unexpected number of dimensions " + dataset.numDimensions() + ". Aborting."
		continue
	}

	N5Utils.save(
	    dataset.getImgPlus(),
	    new N5AmazonS3Writer(s3, bucketName, basePath),
	    file.getParentFile().getName(),
	    blockSize,
	    new GzipCompression(),
		  Executors.newFixedThreadPool( 30 ))

	//Remove directory objects
	def fullPath = basePath + "/" + file.getParentFile().getName()
	def parts = fullPath.split("/")
	def partsPath = "/"
	for (def part : parts) {
		partsPath += part + "/"
		s3.deleteObjects(new DeleteObjectsRequest(bucketName).withKeys(new String[]{partsPath}))
	}
}

println "Done converting dataset to N5"
System.exit(0)