GoogleCloudDataproc/hadoop-connectors

GoogleCloudStorageFileSystem#delete recursive does not page

mswintermeyer opened this issue · 0 comments

GoogleCloudStorageFileSystem#delete assumes that the list of files it is deleting can be stored in memory. Rather than delete one page at a time when deleting a very large directory recursively, it loads them all into a List:

? listFileInfoForPrefix(fileInfo.getPath(), DELETE_RENAME_LIST_OPTIONS)
. It seems there's a listFileInfoForPrefixPage method that it could use instead, and just call that iteratively until all files are deleted.

As a contrast, S3's deletion code uses an iterator to delete directories recursively: https://github.com/apache/hadoop/blob/4bd873b816dbd889f410428d6e618586d4ff1780/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/DeleteOperation.java#L244-L246.