bodgit/sevenzip

about 444G big 7z file, Hangs for 8 hours, don't know what happened

hktalent opened this issue · 13 comments

open from lan network 444G big 7z file
Hangs for 8 hours, don't know what happened
image

Does the archive contain lots of files or is it a small number of very big files? Either way it shouldn't take 8 hours to just get the archive contents so there's definitely something up. The archive contents are at the end of the archive file, so the code just reads some data from the start of the file that tells it where to seek to at the end of the file, does the seek, and then reads in the contents, which may be compressed and/or encrypted.

Ordinarily I'd ask if you can make the archive available, but I suspect you don't want to upload 444 GB, and I don't want to download it either!

Therefore you'll have to do a bit of debugging yourself, unfortunately. I would start by simply scattering some fmt.Println statements in this function https://github.com/bodgit/sevenzip/blob/master/reader.go#L261 and see how far it gets before it hangs. Bonus points for printing out the various offsets as it goes just in case I've done an invalid cast to/from int64 somewhere.

I just noticed you said the file is on a LAN share. Depending on what the protocol is (SMB/CIFS or NFS?) when seeking to the end of the archive to read in the file contents, the seek syscall may be implemented in a way that means it has to read the whole file from the network. So if you can, I'd copy the file locally and then see if you can still reproduce the problem.

@bodgit thanks
There are more than 100 million small files, let me debug and see

Ok, wow. That's going to mean some very big Golang slices.

image

h.filesInfo.file 15087956
image

Can this be optimized as chan out?
image

image

There are a lot of optimizations here
for _, f := range xxx

image

image

image

Debug is too complicated, too many loops consume a lot of time. It is recommended to optimize and read a file information and send it to chan for processing before continuing. Otherwise, the loop is too slow, and only you can optimize and debug it yourself.

You can programmatically build a 7z file of 20 million 0-byte files for debugging

If you have an archive containing 20 million files, you are still going to end up with slices containing that many entries, that's unavoidable. The metadata is packed in a specific order that has to be read in sequence so this makes handing things off to channels tricky.

Consider this code which I think you've highlighted:

	for _, f := range u.folder { // 15,000,000+
		total := uint64(0)
		for _, c := range f.coder {
			total += c.out
		}

		f.size = make([]uint64, total)
		for i := range f.size {
			if f.size[i], err = readUint64(r); err != nil {
                                         // ^^^^^^^^^^^^^ has to be in sequence
				return nil, err
			}
		}
	}

That readUint64() has to be in the same order as the folders are packed. How would a channel help here?

Bear in mind that if the archive metadata is not compressed or encrypted, any offset within r has a 1:1 mapping with the underlying archive file on the disk, but if there's any compression or encryption involved that's no longer true. In order for a separate goroutine to read data, they would likely have to create their own copy of the stream and seek to the desired position by reading and discarding data, so you would potentially end up with lots more file I/O than just processing it sequentially.

Well, it seems that I can only give up, decompress it externally and then process it
Originally intended to be done in memory to save disk space

I will try and create a test archive containing lots of files and see where the bottleneck is. I don't expect the code to be fast, but I'd expect it to complete faster than 8 hours. Like I said, eliminating the LAN and also ensuring your machine isn't starved of memory would be my first step.

How quickly does 7z l -slt archive.7z take to run on the same archive?

image

I'm using the command to decompress and process, I don't know how long it will take
You can programmatically build such a sample file

I was curious how long it takes to just list the contents of the archive with 7z, given that's roughly equivalent to the main() function you posted, i.e. just parse the archive and list the files without extracting anything.

Interestingly, you can see the same 15087956 value, and the headers in the archive (i.e. the file metadata) is approximately 500 MB although I'm not sure if that's with or without compression. That means a Golang program is going to probably consume at least that much memory, probably more.

I don’t know this, I currently use 7z x -aou xxx.7z to decompress, but I found that the decompressed directory
if err := filepath.WalkDir("xxx/", Visit); nil != err {
//if err := filepath.WalkDir("/volume1/home/admin/MyWork/sgk/BreachCompilation/data", Visit); nil != err {
fmt.Printf("filepath.WalkDir() returned %v\n", err)
}
still very slow