golang/go

proposal: compress/gzip: add distinct error for trailing garbage data

rgmz opened this issue · 3 comments

rgmz commented

Proposal Details

Context

When using the compress/gzip package to decompress gzip files, receiving a gzip: invalid header error can indicate two distinct possibilities.

  1. The file is not a valid gzip file.

    Example test case (click to expand)
    import (
        "compress/gzip"
        "io"
        "os"
        "testing"
    )
    
    func TestNotAValidGzipFile(t *testing.T) {
        f, err := os.Open("/tmp/python-3.12.3-amd64.exe") // Arbitrary example, any non-gzip file will suffice.
        if err != nil {
    	    t.Fatal(err)
        }
        defer f.Close()
    
        rc, err := gzip.NewReader(f)
        if err != nil {
    	    t.Fatal(err)
        }
        defer rc.Close()
    
        _, err = io.ReadAll(rc)
        if err != nil {
    	    t.Fatal(err)
        }
    }
    
    // === RUN   TestNotAValidGzipFile
    //  gzip_test.go:19: gzip: invalid header
  2. The file is a valid gzip file, but contains invalid trailing data.

    Example test case (click to expand)
    func TestValidGzipFileWithTrailingData(t *testing.T) {
      // Reproducer file. There are many examples of this.
      // https://github.com/udacity/self-driving-car-sim/blob/4b1f739ebda9ed4920fe895ee3677bd4ccb79218/Assets/Standard%20Assets/Environment/SpeedTree/Conifer/Conifer_Desktop.spm
      f, err := os.Open("/tmp/Conifer_Desktop.spm")
      if err != nil {
        t.Fatal(err)
      }
      defer f.Close()
      
      rc, err := gzip.NewReader(f)
      if err != nil {
        t.Fatal(err)
      }
      defer rc.Close()
      
      _, err = io.ReadAll(rc)
      if err != nil {
        t.Fatal(err)
      }
    }
    
    // === RUN   TestValidGzipFileWithTrailingData
    //  gzip_test.go:19: gzip: invalid header

Scenario 2 can be especially confusing because Go's implementation of compress/gzip rejects invalid trailing data, while many popular applications and languages do not. Hence, the ambiguity of this error can lead people to believe that Go is rejecting a valid gzip file.

$ file -i Conifer_Desktop.spm
Conifer_Desktop.spm: application/gzip; charset=binary
$ gzip -S .spm -d /tmp/Conifer_Desktop.spm
gzip: stdin: decompression OK, trailing garbage ignored. 

Side-note: #47809 (comment) explains why this choice was made and how to handle invalid trailing data using gzip.Reader.Multistream.

Proposal Details

I propose adding a distinct error for when subsequent inputs in a data stream don't have valid headers (i.e., trailing garbage). This would provide clarity for users who many not realize that they need to disable gzip.Reader.Multistream. Additionally, it could be used to programmatically call Multistream(false) for specific files with trailing garbage.

I will leave the specific error message and behave for future discussion, should there be interest in implementing this proposal.


Caveat: I am not familiar with the implementation of gzip.Reader. This change may not be possible or desirable due to technical limitations.

rgmz commented

Similar Issues

https://github.com/golang/go/issues/47809
https://github.com/golang/go/issues/61797

Drats, I spent quite a bit of time searching existing issues. Not sure how I missed the second one.

rsc commented

Duplicate of #61797.