List and compare files in different storage systems
jefftucker opened this issue · 0 comments
This feature would enable a user to input two different locations e.g. two different S3 buckets, an S3 bucket and a Swift folder, etc, and Motuz would output a list of all files in each location along with their sizes. It could optionally show the set intersection, union, and/or disjunction so that a user can figure out if they have any duplicate files (based on name and file size) or any files that are present in one location and NOT present in the other location. This would help users to be able to manage their data more effectively and increase the efficiency of their storage by enabling them to remove duplicate data, copy over only missing files, etc.
Sample implementation:
If I were to compare an S3 bucket to a posix file system manually I would do the following steps:
- run "aws s3 ls --recursive --summarize s3://bucket > bucket.txt
- run "ls -alR /path/to/folder > folder.txt"
- canonicalize the paths in both bucket.txt and folder.txt to show path relative to root folder/bucket, file name, and size in bytes
- sort both folders in order by file name and path
- run "diff bucket.txt folder.txt" to compare and contrast what files are in both locations.
This feature is basically these 5 steps except between any two arbitrary folders/buckets/etc. in whatever storage systems Motuz supports. If this needs to be submitted as a job that then gets returned at a later time for the user to check the results, that would most likely be fine.
Nice to have:
- compare the file hash if the storage system makes that readily available in the metadata
- force the creation of the hash for each file in each location and include this in the results. This could then highlight files with the same name and a different hash or the same hash but a different name/path.