awslabs/mountpoint-s3

Inconsistent file behavior with mountpoint-s3

Closed this issue · 15 comments

/kind bug

NOTE: If this is a filesystem related bug, please take a look at the Mountpoint repo to submit a bug report

What happened?

We have started using https://github.com/awslabs/mountpoint-s3-csi-driver in EKS cluster to mount s3 bucket as volume. Facing some issues while handling files.

  1. If one of the thread deletes a files from mountpoint, the 2nd thread still sees that file. I had to add a delay of 1 second for 2nd thread, after that it also started seeing the file as deleted file (i.e. file did was not shown to this thread). Pls note that both these threads are from same application and accessing same mountpoint. Is there a better way of achieving this? (i am running with --allow-delete flag) As i have similar functionality at different place in my application.
  2. There are two processes in my application, Process one writes file in bucket using mount point , process two reads files from bucket using SDK. Right after (after 30 ms or so) process 1 completes writing to file (i.e after calling fileWriter.Close()), if process two tries to read files, SOMETIMES its says that file is not available. SOMETIMES - based on my analysis so far, if there is a longer delay at process 2, then files is available where as if process 2 starts right after (30-40 ms) process 1 the files is not available.

Mount Options:
` accessModes:
- ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
mountOptions:
- uid=65436
- gid=50
- allow-other
- allow-delete

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version):
    - Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15",
    Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.15",
  • Driver version:
  • 1.8.1

Thanks for opening an issue, @akhilesh-delphix. I note that you are not using any kind of caching so should not be impacted by any relaxed consistency guarantees from that.

If one of the thread deletes a files from mountpoint, the 2nd thread still sees that file. I had to add a delay of 1 second for 2nd thread, after that it also started seeing the file as deleted file (i.e. file did was not shown to this thread). Pls note that both these threads are from same application and accessing same mountpoint. Is there a better way of achieving this? (i am running with --allow-delete flag) As i have similar functionality at different place in my application.

If the two applications are performing these concurrently, one thread may see the file while a delete (aka. unlink) is in-progress. If the two actions are actually running serially, then this behavior would be surprising (since a filesystem delete/unlink should not return until this has been persisted both to the local view and in Amazon S3). Really, we'd need timestamped logs from the application and from Mountpoint to understand what's happening and when. I'd also be interested to understand what you mean by the second thread "seeing the file" - is this from listing the directory or from trying to open or stat the file?

To avoid seeing this behavior, there needs to be a synchronization point between the two threads so that the second thread will always see the new state in S3.

There are two processes in my application, Process one writes file in bucket using mount point , process two reads files from bucket using SDK. Right after (after 30 ms or so) process 1 completes writing to file (i.e after calling fileWriter.Close()), if process two tries to read files, SOMETIMES its says that file is not available. SOMETIMES - based on my analysis so far, if there is a longer delay at process 2, then files is available where as if process 2 starts right after (30-40 ms) process 1 the files is not available.

There are some differing behaviors based on how the file was opened, documented in our Semantics Documentation.

If the application has written some data and it has not duplicated the file descriptor, closing the file should be synchronous and block the system call. I suspect you are in this situation and so should not be experiencing the non-blocking behavior. I'm assuming you're using Java's FileWriter class, which I presume will also block its close method on the system call however this would be something worth verifying. You may also be able to use a tool like strace to review what system calls are being made when.


Apologies for the rather complex answers, please do ask if any clarifications are required.

hi @dannycjones
Thanks a lot for this detailed response.

I think lets address/discuss one issue at a time.

Issue one (which is solved by adding 1 second of gap between the two threads, pls note that 2nd thread only executes ones the first one completes, still i had to had 1 second of delay):

Here is the scenario
inside bucket i have following file structure
main_dir/1/src/file1.txt
main_dir/1/src/file2.txt

Thread 1 runs to delete these files directory '1' recursively. Once its complete successfully.
<IF i add delete here, it works>
Thread 2 uses a Java Filewiter write data into file 'main_dir/1/src/file1.txt' (pls note that the file name is same which was deleted by thread 1), but it fails and gives java.io.FileNotFoundException exception while trying to open file.

Here is the stack (apologies for sharing java stack here)

at java.base/java.io.FileOutputStream.open0(Native Method)
	at java.base/java.io.FileOutputStream.open(FileOutputStream.java:293)
	at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:235)
	at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:123)
	at java.base/java.io.FileWriter.<init>(FileWriter.java:66)

Pls note our code base works fine when we use NFS to store/manage files.

Will it be possible to have a connect over slack or teams or any other channel on this topic (in case you think the discussion is too specific)?

hi @dannycjones
I tried to capture Mount point/ Driver/My application logs when this error happens (Pls note that this issue is intermittent)

Mount point logs

Oct 03 15:33:53 hyperscale-k8s mount-s3[31915]: [WARN] rmdir{req=6914 parent=261 name="src"}: mountpoint_s3::fuse: rmdir failed: inode error: file "src" does not exist in parent inode 261 (full key "default_staging_area/16/")
Oct 03 15:42:14 hyperscale-k8s mount-s3[32151]: [WARN] rmdir{req=9668 parent=566 name="masking-file-formats"}: mountpoint_s3::fuse: rmdir failed: inode error: file "masking-file-formats" does not exist in parent inode 566 (full key "default_staging_area/19/")
Oct 03 15:42:14 hyperscale-k8s mount-s3[32151]: [WARN] rmdir{req=9710 parent=577 name="3cf64fcfec1a09b7e801f3ed45887a65"}: mountpoint_s3::fuse: rmdir failed: inode error: file "3cf64fcfec1a09b7e801f3ed45887a65" does not exist in parent inode 577 (full key "default_staging_area/19/tgt/8767199970ae27ffbe4de378f73378fd/")
Oct 03 15:42:14 hyperscale-k8s mount-s3[32151]: [WARN] rmdir{req=9712 parent=576 name="8767199970ae27ffbe4de378f73378fd"}: mountpoint_s3::fuse: rmdir failed: inode error: file "8767199970ae27ffbe4de378f73378fd" does not exist in parent inode 576 (full key "default_staging_area/19/tgt/")
Oct 03 15:42:14 hyperscale-k8s mount-s3[32151]: [WARN] rmdir{req=9714 parent=566 name="tgt"}: mountpoint_s3::fuse: rmdir failed: inode error: file "tgt" does not exist in parent inode 566 (full key "default_staging_area/19/")
Oct 03 15:42:14 hyperscale-k8s mount-s3[31915]: [WARN] rmdir{req=7130 parent=368 name="3cf64fcfec1a09b7e801f3ed45887a65"}: mountpoint_s3::fuse: rmdir failed: inode error: file "3cf64fcfec1a09b7e801f3ed45887a65" does not exist in parent inode 368 (full key "default_staging_area/19/src/8767199970ae27ffbe4de378f73378fd/")
Oct 03 15:42:14 hyperscale-k8s mount-s3[31915]: [WARN] rmdir{req=7132 parent=367 name="8767199970ae27ffbe4de378f73378fd"}: mountpoint_s3::fuse: rmdir failed: inode error: file "8767199970ae27ffbe4de378f73378fd" does not exist in parent inode 367 (full key "default_staging_area/19/src/")
Oct 03 15:42:15 hyperscale-k8s mount-s3[31915]: [WARN] rmdir{req=7134 parent=366 name="src"}: mountpoint_s3::fuse: rmdir failed: inode error: file "src" does not exist in parent inode 366 (full key "default_staging_area/19/")
Oct 03 15:57:42 hyperscale-k8s mount-s3[32151]: [WARN] rmdir{req=10422 parent=595 name="masking-file-formats"}: mountpoint_s3::fuse: rmdir failed: inode error: file "masking-file-formats" does not exist in parent inode 595 (full key "default_staging_area/20/")

Driver Log

I1003 15:41:53.419641       1 node.go:207] NodeGetCapabilities: called with args 
I1003 15:42:26.836592       1 node.go:207] NodeGetCapabilities: called with args 
I1003 15:42:54.614552       1 node.go:207] NodeGetCapabilities: called with args 
I1003 15:43:16.608525       1 node.go:207] NodeGetCapabilities: called with args 

My Application Log (indicating that its not able to open file right after deleting file with same name)

03-10-2024 15:42:16.953 [UnloadThread-19-TABLE_NAME-KMCVM_2_0] ERROR c.d.h.s.UnloadService.exportDataAndWriteToFiles - Failed in reading row for hsExecutionId: 19, sourceKey: {schema_name=DLPXDBORA, table_name=KMCVM_2} 
java.io.FileNotFoundException: /etc/hyperscale/default_staging_area/19/src/8767199970ae27ffbe4de378f73378fd/3cf64fcfec1a09b7e801f3ed45887a65/3cf64fcfec1a09b7e801f3ed45887a65_0_0.csv (No such file or directory)
        at java.base/java.io.FileOutputStream.open0(Native Method)
        at java.base/java.io.FileOutputStream.open(FileOutputStream.java:293)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:235)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:123)
        at java.base/java.io.FileWriter.<init>(FileWriter.java:66)

Hey @akhilesh-delphix, the failure with java.io.FileNotFoundException suggests that the file has been deleted successfully by thread 1. It seems like your thread 1 removed all the files in a directory, so Mountpoint ended up with empty directories and considered them to be deleted too (see more details in directory operations semantic).

Later on thread 2, your application was trying write to a new file in a non-existing directory so it failed with java.io.FileNotFoundException.

It still doesn't explain why putting a delay between the two threads make things work though. Probably because there are some other threads that create the deleted directories during that time. We will need debug logs to confirm our assumption and for further investigation.

To enable debug logs, you can pass CLI arguments from CSI Driver to Mountpoint in the mountOptions of the Persistent Volume spec. For example:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: s3-pv
spec:
  capacity:
    storage: 1200Gi # Ignored, required
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  claimRef:
    namespace: default
    name: s3-pvc
  mountOptions:
    - debug
    - debug-crt
  csi:
    driver: s3.csi.aws.com # Required
    volumeHandle: s3-csi-driver-volume
    volumeAttributes:
      bucketName: s3-csi-driver

I actually don't think this is related to the CSI Driver specifically, so just running your application against Mountpoint directly should work too.

hi @monthonk
you are right about thread 1, it deletes the file.

However thread 2, first creates file with same name/path and then tries to open it. Its done by Java FileOutputStream API (i had attached stack for it earlier). Our application code simply passes the path/name.

Here is code for at java.base/java.io.FileOutputStream.(FileOutputStream.java:123)

 public FileOutputStream(String name) throws FileNotFoundException {
        this(name != null ? new File(name) : null, false);
    }

at java.base/java.io.FileOutputStream.(FileOutputStream.java:235)

Following code successfully validates file path and validity of file and then in 2nd last line (open(name, append);), it tries to open file and throws exception (java.io.FileNotFoundException)

public FileOutputStream(File file, boolean append)
        throws FileNotFoundException
    {
        String name = (file != null ? file.getPath() : null);
        @SuppressWarnings("removal")
        SecurityManager security = System.getSecurityManager();
        if (security != null) {
            security.checkWrite(name);
        }
        if (name == null) {
            throw new NullPointerException();
        }
        if (file.isInvalid()) {
            throw new FileNotFoundException("Invalid file path");
        }
        this.fd = new FileDescriptor();
        fd.attach(this);
        this.path = name;

        open(name, append);
        FileCleanable.register(fd);   // open sets the fd, register the cleanup
    }

The check on file.isInvalid() is very limited according to the javadoc. It does not guarantee that the path is valid. Would you be able to grab Mountpoint debug logs? We will need to know what file system calls Mountpoint receives from the kernel to identify the issue.

hi @monthonk
Mount log files

mount.logs.zip

I have stripped the log file to include logs for duration when 1st thread executed delete operation and 2nd thread executes the create and open file operation.

Here i am posting the exception received by 2nd thread (this exception message includes the file name which its trying to create, along with time, it should help in mapping it to log).

09-10-2024 06:53:14.091 [UnloadThread-6-TABLE_NAME-TEST_CLEANUP_0_1] ERROR c.d.hyperscale.service.UnloadService.exportDataAndWriteToFiles - Failed in reading row for hsExecutionId: 6, sourceKey: {schema_name=DLPXDBORA, table_name=TEST_CLEANUP_0} 
java.io.FileNotFoundException: /etc/hyperscale/default_staging_area/6/src/8767199970ae27ffbe4de378f73378fd/0d4e15bcfc913063cb07a7ef6a100532/0d4e15bcfc913063cb07a7ef6a100532_1_0.csv (No such file or directory)
        at java.base/java.io.FileOutputStream.open0(Native Method)
        at java.base/java.io.FileOutputStream.open(FileOutputStream.java:293)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:235)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:123)
        at java.base/java.io.FileWriter.<init>(FileWriter.java:66)

Hey, thanks for providing the logs. I have reviewed them and here is what I got from the logs:

  1. The application listed the files (readdir) in directory 0d4e15bcfc913063cb07a7ef6a100532.
  2. The application deleted all file in directory 0d4e15bcfc913063cb07a7ef6a100532 making the prefix 0d4e15bcfc913063cb07a7ef6a100532/ disappear from S3 bucket.
  3. The application tried to delete directory 0d4e15bcfc913063cb07a7ef6a100532 with rmdir but failed to do so because Mountpoint did lookup against S3 and the prefix was not found.
  4. The tricky part is that the kernel also has a cache from the first readdir (1 second by default) and your application might rely on this cache to determine the existence of the directory at some point.
  5. Before your application open a new file, it sent lookup calls to check the directory existence and eventually found that it doesn't exist. So, the FileNotFoundException was thrown because the application was trying to write to a non-existing directory.

I still don't understand why adding a delay would could make it work. Are there any logic in thread 2 that create the directory if not exist? Also would be helpful to get debug logs from this case too.

Hi @monthonk ,
Like i mentioned in my previous comment following Java API code creates new file (new File(name)), our application simply calls this API with a file-name/path to it

 public FileOutputStream(String name) throws FileNotFoundException {
        this(name != null ? new File(name) : null, false);
    }

Pls note that we have not defined any cache while defining volume mount options.

Mount Options:
 accessModes:
- ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
mountOptions:
- uid=65436
- gid=50
- allow-other
- allow-delete

Is it possible to connect over slack or any other communication channel, so that we can discuss and identify root cause?

hi @monthonk ,

I was thinking about point 4 that you had mentioned in your response. Could that result into following case:

  1. when application invokes new File(name), based on point 4 the file is available in cache, hence it does not create a new file (ideally it should have as the file has been deleted by thread 1),
  2. by the time, it invokes API to open file (open(name, append); ), the file is removed from the cache, and hence it throws the FileNotFoundException.

this just my hypothesis.

Hey @akhilesh-delphix, I think the missing information is that we don't know how Java API interacts with the file system. Like Danny said, a tool like strace could help in this case.

Other communication channel we have is through AWS support. Please feel free to open a support case if you prefer a private channel.

From point 4, I meant the parent directory of the file could appear to exist where it's already gone. I think this behavior is surprising for users and we may want to improve that. I opened #1055 for tracking.

Unfortunately, I don't have a good workaround for your use case right now. From what I understand, thread 1 is supposed to delete all the files and directories recursively. Would it be possible to have thread 2 wait until you can confirm that the directory has been deleted before creating a new file?

Thread 2 starts to run after Thread 1 complete deletion on files and directories recursively, and deletion of files is a blocking call. Hence thread 2 only starts after files and directories are deleted.

If adding wait is the workaround, i will add a wait of 1 second (which worked for me earlier) . It, however, is little challenging as i need to add this at multiple places in code.

Meanwhile, we have created a support ticket. I can share that support ticket number here if you want.

Lets discuss issue 2 and workaround for issue 1 as part of support ticket.

Thanks @akhilesh-delphix, please share the ticket number so we can follow up there.

I'm closing this issue now. We can create new ones for tracking specific issues if any were found.

hi @monthonk : support ticket number : 172865712600531