FileExistsError during extraction with Netgear firmware

Question

FileExistsError during extraction with Netgear firmware

AndrewFasano opened this issue 5 months ago · 4 comments

Describe the bug
During extraction of at least 29 NETGEAR firmware images, unblob may try creating the same output file twice triggering an exception. As a result, some files that should be extracted are not.

To Reproduce
Steps to reproduce the behavior:

Download a sample firmware to trigger the bug with: wget https://www.downloads.netgear.com/files/GDC/M4100/M4100-V10.0.2.20.zip
Launch unblob with command unblob -v M4100-V10.0.2.20.zip
See error:

2024-02-10 23:39.13 [error    ] Unknown error happened while extracting chunk pid=2295991
Traceback (most recent call last):
  File "/unblob/unblob/processing.py", line 607, in _extract_chunk
    if result := chunk.extract(inpath, extract_dir):
  File "/unblob/unblob/models.py", line 115, in extract
    return self.handler.extract(inpath, outdir)
  File "/unblob/unblob/models.py", line 452, in extract
    return self.EXTRACTOR.extract(inpath, outdir)
  File "/unblob/unblob/handlers/archive/cpio.py", line 384, in extract
    parser.dump_entries(fs)
  File "/unblob/unblob/handlers/archive/cpio.py", line 215, in dump_entries
    fs.carve(entry.path, self.file, entry.start_offset, entry.size, mode=entry.mode & 0o777)
  File "/unblob/unblob/file_utils.py", line 511, in carve
    carve(safe_path, file, start_offset, size, mode=mode)
  File "/unblob/unblob/file_utils.py", line 294, in carve
    with carve_path.open("xb") as f:
  File "/usr/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileExistsError: [Errno 17] File exists: '/tmp/tmp1151iav4/M4100_V10.0.2.20.zip_extract/m4100v10.0.2.20.stk_extract/1201148-2097967.lzma_extract/lzma.uncompressed_extract/lib/libthread_db-1.0.so'

Expected behavior
This error should not be raised, instead additional files should be extracted. I made a simpel change in file_utils.py's carve method (see below) to return early if the target file already exists and with this change an extra 75 files are created in [extract_dir]/m4100v10.0.2.20.stk_extract/1201148-2097967.lzma_extract/lzma.uncompressed_extract. I doubt this is the right fix, but it shows that this bug prevents some files from being extracted.

Environment information:

OS: Ubuntu 22.04
Docker

Linux b4935d734f27 6.2.2 #3 SMP PREEMPT_DYNAMIC Wed Mar  8 12:03:22 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"

The following executables found installed, which are needed by unblob:
    7z                          ✓
    debugfs                     ✓
    jefferson                   ✓
    lz4                         ✓
    lziprecover                 ✓
    lzop                        ✓
    sasquatch                   ✓
    sasquatch-v4be              ✓
    simg2img                    ✓
    ubireader_extract_files     ✓
    ubireader_extract_images    ✓
    unar                        ✓
    zstd                        ✓

Additional context
I found this bug while doing some large-scale evaluations of filesystems produced by binwalk and unblob using fw2tar.

My (likely-incorrect) patch that results in additional files being created:

diff --git a/unblob/file_utils.py b/unblob/file_utils.py
index 21e887b..3db4b98 100644
--- a/unblob/file_utils.py
+++ b/unblob/file_utils.py
@@ -291,6 +291,9 @@ def carve(carve_path: Path, file: File, start_offset: int, size: int):
     """Extract part of a file."""
     carve_path.parent.mkdir(parents=True, exist_ok=True)

+    if carve_path.exists():
+        print(f"Warning not replacing {carve_path}")
+        return
     with carve_path.open("xb") as f:
         for data in iterate_file(file, start_offset, size):
             f.write(data)

After fixing this, I got another error along the same vein in file_utils which I patched with:

diff --git a/unblob/file_utils.py b/unblob/file_utils.py
index 21e887b..3db4b98 100644
--- a/unblob/file_utils.py
+++ b/unblob/file_utils.py

@@ -579,7 +582,8 @@ class FileSystem:
         if safe_link:
             dst = safe_link.dst.absolute_path
             self._ensure_parent_dir(dst)
-            dst.symlink_to(src)
+            if not dst.exists():
+                dst.symlink_to(src)

     def create_hardlink(self, src: Path, dst: Path):
         """Create a new hardlink dst to the existing file src."""

Answer 1 · 2024-02-11T08:02:43.000Z

Thank you for the very detailed report. I had a quick look and it's probably a bug in the CPIO extractor. I'll keep you posted.

Answer 2 · 2024-02-11T18:29:18.000Z

The bug is triggered by a CPIO archive with the same entry stored twice:

 7z l sample.cpio | grep thread_db-1.0
2012-01-11 23:17:40 .....        32220        32220  /lib/libthread_db-1.0.so
2012-01-11 23:17:40 .....        32220        32220  /lib/libthread_db-1.0.so

Answer 3 · 2024-02-11T18:51:00.000Z

@AndrewFasano I opened the discussion on duplicate entries with a draft fix at #756 , your feedback is very welcomed :)

Answer 4 · 2024-03-13T12:43:49.000Z

$ cpio -t < 'M4100-V10.0.2.20.zip_extract/m4100v10.0.2.20.stk_extract/1201148-2097967.lzma_extract/lzma.uncompressed'| sort | uniq -c | sort -n
...
      1 /var/run
      1 /var/run/utmp
      2 /lib/libthread_db-1.0.so
      2 /sbin/cfe_env
      2 /usr/bin/sort
      2 /usr/bin/tail
      2 /usr/bin/test
      2 /usr/bin/tftp
      2 /usr/bin/top
      2 /usr/bin/traceroute
      2 /usr/bin/uptime
      2 /usr/bin/wc
      2 /usr/bin/which
      2 /usr/bin/xargs
      2 /usr/bin/yes
      2 /usr/sbin/chroot

It looks like only a couple of binaries got patched during the build process (and maybe some previously non-existing added), so I think, the proper solution would be to overwrite duplicate entries on extracting cpio archives.

I would also limit the solution to the cpio extractor, and would not make general behavior change in FileSystem.