Unblob keeps files even tho it shoulnt
xeor opened this issue · 2 comments
Describe the bug
When unpacking, I want all the successfully extracted files to be deleted as as described and implemented from #261.
Either there is a bug, or the documentation is unclear as it clearly states that chunks are not kept by default.
To Reproduce
Steps to reproduce the behavior:
0. Create a simple archive containing some files.
- Launch unblob with command
unblob filename
- List the files and see both the "filename" and "filename_extract" is there.
Example
# 1, 2, 3 and 4 are tarballs as well
unblob@d0842b7da70a:/tmp/unblob$ tar -tvf tarball
-rw-r--r-- unblob/unblob 10240 2023-09-29 08:49 1
-rw-r--r-- unblob/unblob 10240 2023-09-29 08:49 2
-rw-r--r-- unblob/unblob 10240 2023-09-29 08:49 3
-rw-r--r-- unblob/unblob 10240 2023-09-29 08:49 4
unblob@d0842b7da70a:/tmp/unblob$ unblob tarball
...
unblob@d0842b7da70a:/tmp/unblob$ ls -l tarball_extract/
total 64
-rw-r--r-- 1 unblob unblob 10240 Sep 29 08:49 1
drwxrwxr-x 2 unblob unblob 4096 Sep 29 09:15 1_extract
-rw-r--r-- 1 unblob unblob 10240 Sep 29 08:49 2
drwxrwxr-x 2 unblob unblob 4096 Sep 29 09:15 2_extract
-rw-r--r-- 1 unblob unblob 10240 Sep 29 08:49 3
drwxrwxr-x 2 unblob unblob 4096 Sep 29 09:15 3_extract
-rw-r--r-- 1 unblob unblob 10240 Sep 29 08:49 4
drwxrwxr-x 2 unblob unblob 4096 Sep 29 09:15 4_extract
....
Expected behavior
Unless I specify --keep-extracted-chunks
, I expect the files 1, 2, 3 and 4 to be unlinked inside the tarball_extract
folder.
Environment information (please complete the following information):
- OS: Linux
- container: Running latest container image found as of today 2023-09-29
Additional context
The line
Line 589 in 0048d52
carved_path
is None, because chunk.is_whole_file
at the top of the function is True.
As stated, the doc is confusing. I'm not sure if this is a bug, or if it works as expected but a "chunk" is something other than the whole file when it comes to the "--keep-extracted-chunks" flag
This is the expected behavior as in your example 1,2,3,4 are normal files in the tar which are processed further. We never delete original files, but we delete by default unpacked chunks.
Chunks are parts of files when a certain content is not the whole file, this case we identify the start/end offset and carve the chunk, which can be extracted. If a chunk is extracted we delete it (this can be controlled by the --keep-extracted-chunks flags).
So, there is a difference between a file that was extracted from something vs. a chunk. The confusing part could be that both chunks and extracted content is stored in a _extract directory, though if we detect the the chunk equals the whole file, we skip the chunk carving step.
Hope this answer your question.
We can maybe add an extra flag to delete all extracted files. Why do you want these to be deleted by the way?
Thanks for the clarification. I understand now.. A quick example if someone else gets here;
unblob@1df5df510d60:/tmp/test-chunk$ cat tarball tarball > double-tarball
unblob@1df5df510d60:/tmp/test-chunk$ unblob --keep-extracted-chunks double-tarball -e with_keep_flag
unblob@1df5df510d60:/tmp/test-chunk$ unblob double-tarball -e no_keep_flag
unblob@1df5df510d60:/tmp/test-chunk$ ls -l */double-tarball_extract*
no_keep_flag/double-tarball_extract:
total 8
drwxrwxr-x 5 unblob unblob 4096 Sep 29 12:04 0-40960.tar_extract
drwxrwxr-x 5 unblob unblob 4096 Sep 29 12:04 40960-81920.tar_extract
with_keep_flag/double-tarball_extract:
total 88
-rw-r--r-- 1 unblob unblob 40960 Sep 29 12:04 0-40960.tar
drwxrwxr-x 5 unblob unblob 4096 Sep 29 12:04 0-40960.tar_extract
-rw-r--r-- 1 unblob unblob 40960 Sep 29 12:04 40960-81920.tar
drwxrwxr-x 5 unblob unblob 4096 Sep 29 12:04 40960-81920.tar_extract
I have a pipeline that unpacks and scan all unpacked files. There is no need for it to scan files that where successfully unpacked, so to save some time scanning these huge archives I'm dealing with I wanted to only keep what was nesesarry.