Anuken/Arc

Really huge CPU and memory leaks in ZipFi

SSTentacleSS opened this issue · 18 comments

I recently discovered that my plugins are loading very slowly and decided to turn to the server byte code, while examining it, I noticed that searching for plugin.json using ZipFi takes about 4 seconds and apparently it is called more than once, so I wrote small benchmark with my implementation, implementation using Java libraries and using your ZipFi (using Scala)

Results:

[INFO] Executing [java, -jar, -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8000, /tmp/servers/763134302e34.jar, host]
Listening for transport dt_socket at address: 8000
19.921235
{
  "name": "OmnidustryHeadPlugin",
  "author": "OD-Developers",
  "repo": "https://github.com/OD-Developers/OmnidustryHeadPlugin",
  "main": "com.omnidustry.head.Main",
  "description": "Главный плагин серверов омни с основными функциями",
  "version": "1.1.3-a",
  "auth": "OD-Developers"
}
939.067475
null
18.962345
{
  "name": "OmnidustryHeadPlugin",
  "author": "OD-Developers",
  "repo": "https://github.com/OD-Developers/OmnidustryHeadPlugin",
  "main": "com.omnidustry.head.Main",
  "description": "Главный плагин серверов омни с основными функциями",
  "version": "1.1.3-a",
  "auth": "OD-Developers"
}
[01-07-2023 14:10:37] [I] 1 mods loaded.
[01-07-2023 14:10:37] [I] Server loaded. Type 'help' for help.

test.scala

def test = {

  // Java default realization
  
  val c = System.nanoTime()
  val jarFile = JarFile(File("/tmp/servers/763134302e34-debug/config/mods/OmnidustryHeadPlugin.jar"))
  val pluginJson = jarFile
    .entries()
    .asIterator()
    .asScala
    .find(_.getName == "plugin.json")
  val contentByJarFile = String(jarFile.getInputStream(pluginJson.get).readAllBytes(), StandardCharsets.UTF_8)

  System.out.println((System.nanoTime() - c) / 1000000.0)
  System.out.println(contentByJarFile)

  // Current realization

  val cc = System.nanoTime()
  val zipFi = ZipFi(Fi(File("/tmp/servers/763134302e34-debug/config/mods/OmnidustryHeadPlugin.jar")))
  
  // ... Reading plugin.json not problem in such case

  System.out.println((System.nanoTime() - cc) / 1000000.0)
  System.out.println("null")

  // My Realization

  val ccc = System.nanoTime()
  val pluginJson2 = com.omnidustry.head.custom.ZipFi(
    ZipFile(File("/tmp/servers/763134302e34-debug/config/mods/OmnidustryHeadPlugin.jar"))
  )
    .child("plugin.json")
    .readString()
  
  System.out.println((System.nanoTime() - ccc) / 1000000.0)
  System.out.println(pluginJson2)
}

ZipFi.scala

package com.omnidustry.head.custom

import arc.files.Fi
import arc.Files.FileType
import java.util.Collections
import java.util.zip.ZipFile
import java.util.Optional
import java.util.zip.ZipEntry
import scala.util.Try
import java.io.File
import java.io.InputStream

class ZipFi(
  zipFile: ZipFile,
  entry: Optional[ZipEntry] = Optional.empty(),
  path: String = ""
)(
  implicit pathEntries: arc.struct.Seq[String] =
    arc.struct.Seq.`with`(
      Collections.list(zipFile.entries())
    ).asInstanceOf[arc.struct.Seq[ZipEntry]]
      .map(_.getName().replace("\\", "/"))
      .filter(_.startsWith(path + "/"))
      .map(_.replace(path + "/", ""))
) extends Fi(path, FileType.absolute) {
  val names: arc.struct.Seq[String] = pathEntries
    .filter(zipPath => {
      val delimiters: Long = countDelimiters(zipPath.replace(path, ""))
      
      if (zipPath == "")
        false
      else if (delimiters == 1 && zipPath.endsWith("/"))
        true
      else if (delimiters == 0)
        true
      else false
    })

  override def child(name: String): Fi = {
    val path = name.split("/")
    val slashes = countDelimiters(name)
    val thisPathWithDelimiter = if (this.path.isBlank()) "" else this.path + "/"
      
    if (slashes == 0)
      val entry = zipFile.getEntry(thisPathWithDelimiter + name)

      if (entry == null)
        new Fi(File(file, thisPathWithDelimiter + name)) {
          override def exists(): Boolean = false
        }
      else ZipFi(
        zipFile = zipFile,
        entry = Optional.of(entry),
        path = thisPathWithDelimiter + name
      )
    else if (slashes == 1 && name.endsWith("/"))
      ZipFi(
        zipFile = zipFile,
        path = thisPathWithDelimiter + name
      )
    else
      ZipFi(
        zipFile = zipFile,
        path = thisPathWithDelimiter + path(0)
      ).child(
        path.drop(1)
          .mkString("/")
      )
  }
   
  def getPath() = this.path

  override def delete(): Boolean = Try(zipFile.close)
    .fold(_ => false, _ => true)

  override def exists(): Boolean = true

  override def name(): String = file.getName()

  override def parent(): Fi = ???

  override def list(): Array[Fi] = names
    .map(this.child(_))
    .toArray()
    
  override def isDirectory(): Boolean = entry.isEmpty() || entry.get().isDirectory()

  override def read(): InputStream = {
    if (entry.isEmpty())
      throw new RuntimeException("Not permitted")
    
    try {
      zipFile.getInputStream(entry.get())
    } catch error => throw new RuntimeException(error)
  }

  override def length(): Long = if (isDirectory()) 0 else entry.get().getSize()

  override def toString(): String = getPath()

  inline private def countDelimiters(path: String): Long =
    path.chars()
      .filter(_ == '/')
      .count()
}

I'm not sure what the exact bottleneck here is, but it could be related to zip entries being scanned and populated eagerly instead of on-demand.

Yeah, most of the problems with this are related, the structure of the Zip archive is built, which also in addition causes a very expensive getEntries operation at each stage of creating a ZipFi for the entire Zip archive

At the same time, isn't it the case that you are calling entries() yourself every file creation anyway, in the constructor? Mine should only call it once, upon root creation. Other files do not use that constructor.

This could just be something related to The O(N^2) parent/child lookup at the end, which has nothing to do with finding entries.

Can you ask for help? Since I don't know your code rules in general, so I wrote it the way I would, and the code editor doesn't want to open Arc, so I created ZipFi in another project with libraries from Arc

In fact, in my example, the implicit value of entries means exactly that this value will be calculated once, and then it will be implicitly passed between all new ZipFi constructors without the need to recreate it

most likely, it is simply impossible to quickly go through about 10 MB of source data with 27 thousand subfolders

In fact, in my example, the implicit value of entries means exactly that this value will be calculated once, and then it will be implicitly passed between all new ZipFi constructors without the need to recreate it

But in either case, you need to iterate through every single zip entry when you create a new file... no? That sounds like it would be slower in the long run. Even if you cache the result of entries() (where? I don't see how Scala works in that regard), you are creating a new ZipFi every time child() is called, which seems to imply a scan through all these entries.

Yes, I understood this when I wrote the example, but it didn’t really matter, the example was still not aimed at being fast, the new ZipFi already takes into account the filtering of entries

Tests that everything works:

def test = {

  val cc = System.nanoTime()
  val zipFi = ZipFi(Fi(File("/tmp/servers/763134302e34-debug/config/mods/OmnidustryHeadPlugin.jar")))
  val childread = zipFi.child("plugin.json");

  Log.info("-- childread --", null);
  Log.info(childread.path(), null);
  Log.info(childread.exists());
  Log.info(childread.readString(), null);

  val ohno = zipFi.child("pluginasdasd adsa.json");

  Log.info("-- ohno --", null);
  Log.info(ohno.path(), null);
  Log.info(ohno.exists());

  val notExists = zipFi.child("con/sun/jna/not/exists.txt")

  Log.info("-- notExists --", null);
  Log.info(notExists.path(), null);
  Log.info(notExists.exists());

  val notExistsFolder = zipFi.child("con/sun/jna/not/exists")

  Log.info("-- notExistsFolder --", null);
  Log.info(notExistsFolder.path(), null);
  Log.info(notExistsFolder.exists());
}
[INFO] Executing [java, -jar, -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8000, /tmp/servers/763134302e34.jar, host]
Listening for transport dt_socket at address: 8000
[01-08-2023 19:01:58] [I] -- childread --
[01-08-2023 19:01:58] [I] plugin.json
[01-08-2023 19:01:58] [I] true
[01-08-2023 19:01:58] [I] {
  "name": "OmnidustryHeadPlugin",
  "author": "OD-Developers",
  "repo": "https://github.com/OD-Developers/OmnidustryHeadPlugin",
  "main": "com.omnidustry.head.Main",
  "description": "Главный плагин серверов омни с основными функциями",
  "version": "1.1.3-a",
  "auth": "OD-Developers"
}
[01-08-2023 19:01:58] [I] -- ohno --
[01-08-2023 19:01:58] [I] /pluginasdasd adsa.json
[01-08-2023 19:01:58] [I] false
[01-08-2023 19:01:58] [I] -- notExists --
[01-08-2023 19:01:58] [I] /con
[01-08-2023 19:01:58] [I] false
[01-08-2023 19:01:58] [I] -- notExistsFolder --
[01-08-2023 19:01:58] [I] /con
[01-08-2023 19:01:58] [I] false
[01-08-2023 19:02:00] [I] 1 mods loaded.
[01-08-2023 19:02:00] [I] Server loaded. Type 'help' for help.
[01-08-2023 19:02:00] [I] Found 1 command-line arguments to parse.
[01-08-2023 19:02:00] [I] Randomized next map to be Ancient Caldera.
[01-08-2023 19:02:00] [I] Loading map...
[01-08-2023 19:02:00] [I] Map loaded.
[01-08-2023 19:02:00] [I] Opened a server on port 6567.

And full code of ZipFi.java:

package com.omnidustry.head;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
import java.util.Collections;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;

import arc.Files.FileType;
import arc.files.Fi;
import arc.struct.Seq;

public class ZipFi extends Fi {
  protected Seq<String> pathEntries;
  protected Seq<String> names;
  protected ZipFile zipFile;
  protected String path;
  protected ZipFi parent;
  protected ZipEntry zipEntry;

  /**
   * Constructor to use if path is not a target file
   * 
   * @param zipFile Zip file
   * @param parent Parent Zip
   * @param additionalPath Path to be added to an parent path
   * @param pathEntries Available folder/files entries
   */
  protected ZipFi(ZipFile zipFile, ZipFi parent, String additionalPath, Seq<String> pathEntries) {
    super(
      (parent == null ? "" : parent.path() + "/") + additionalPath,
      FileType.absolute
    );

    this.pathEntries = pathEntries;
    this.zipFile = zipFile;
    this.path = super.path();
    this.parent = parent;
    this.names = pathEntries
      .copy()
      .filter(zipPath -> {
        long delimiters = countDelimiters(zipPath.replace(path, ""));

        return zipPath != "" &&
          ((delimiters == 1 && zipPath.endsWith("/")) ||
          delimiters == 0);
      });

    if (this.path.startsWith("/"))
      this.path = this.path.substring(1);
  }

  protected ZipFi(ZipFile zipFile, ZipFi parent, ZipEntry zipEntry, String additionalPath) {
    super(
      (parent == null ? "" : parent.path() + "/") + additionalPath,
      FileType.absolute
    );
    
    this.zipFile = zipFile;
    this.zipEntry = zipEntry;
    this.path = super.path();
    this.parent = parent;

    if (this.path.startsWith("/"))
      this.path = this.path.substring(1);
  }

  /**
   * Constructor to use if this is the first occurrence of a Zip file
   * 
   * @param zipFile Zip file
   */
  protected ZipFi(ZipFile zipFile) {
    this(
      zipFile,
      null,
      "",
      Seq.with(Collections.list(
        zipFile.entries()
      )).map(entry -> entry.getName())
    );
  }

  /**
   * Constructor of Zip filesystem iterator
   * 
   * @param zipPath Path to Zip archive
   * @throws IOException in case the file was opened with an exception
   */
  public ZipFi(Fi zipPath) throws IOException {
    this(new ZipFile(zipPath.path()));
  }

  /**
   * Count the number of delimiters in a path
   * 
   * @param path Path
   * @return Number of delimiters
   */
  protected long countDelimiters(String path) {
    return path.chars()
      .filter(ch -> ch == '/')
      .count();
  }

  @Override
  public Fi child(String childPath) {
    String[] path = childPath.split("/");
    var delimiters = countDelimiters(childPath);

    if (delimiters == 0 || !names
      .map(name -> name.replaceAll("/", ""))
      .contains(path[0])
    ) {
      var entryPath = (this.path.equals("") ? "" : this.path + "/") + path[0];
      var entry = zipFile.getEntry(entryPath);

      if (entry == null)
        return new Fi(new File(file, path[0])) {

          @Override
          public boolean exists() {
            return false;
          }

        };
      return new ZipFi(zipFile, this, entry, path[0]);
    } else if (delimiters == 1 && childPath.endsWith("/"))
      return new ZipFi(
        zipFile,
        this,
        path[0],
        pathEntries
          .copy()
          .filter(zipPath -> zipPath.startsWith(path[0] + "/"))
          .map(zipPath -> zipPath.replace(path[0] + "/", ""))
      );
    else return new ZipFi(
      zipFile,
      this,
      path[0],
      pathEntries
        .copy()
        .filter(zipPath -> zipPath.startsWith(path[0] + "/"))
        .map(zipPath -> zipPath.replace(path[0] + "/", ""))
    ).child(
      Seq.with(Arrays.copyOfRange(
        path,
        1,
        path.length
      )).toString("/")
    );
  }

  @Override
  public String path() {
    return path;
  }

  @Override
  public InputStream read() {
    if (zipEntry == null || zipEntry.isDirectory())
      throw new RuntimeException("Not permitted");
    
    try {
      return zipFile.getInputStream(zipEntry);
    } catch (IOException exception) {
      throw new RuntimeException(exception);
    }
  }

  @Override
  public boolean delete() {
    try {
      zipFile.close();
      return true;
    } catch (Exception exception) {
      return false;
    }
  }

  @Override
  public boolean exists() {
    return true;
  }

  @Override
  public Fi parent() {
    return this.parent;
  }

  @Override
  public Fi[] list() {
    return names
      .map(childName -> this.child(childName))
      .toArray();
  }

  @Override
  public boolean isDirectory() {
    return zipEntry == null || zipEntry.isDirectory();
  }

  @Override
  public long length() {
    if (this.isDirectory())
      return 0;
    else
      return zipEntry.getSize();
  }

  @Override
  public String toString() {
    return path();
  }

}

(oops, sorry, I wrote con instead of com)

Listening for transport dt_socket at address: 8000
[01-08-2023 19:05:12] [I] -- childread --
[01-08-2023 19:05:12] [I] plugin.json
[01-08-2023 19:05:12] [I] true
[01-08-2023 19:05:12] [I] {
  "name": "OmnidustryHeadPlugin",
  "author": "OD-Developers",
  "repo": "https://github.com/OD-Developers/OmnidustryHeadPlugin",
  "main": "com.omnidustry.head.Main",
  "description": "Главный плагин серверов омни с основными функциями",
  "version": "1.1.3-a",
  "auth": "OD-Developers"
}
[01-08-2023 19:05:12] [I] -- ohno --
[01-08-2023 19:05:12] [I] /pluginasdasd adsa.json
[01-08-2023 19:05:12] [I] false
[01-08-2023 19:05:12] [I] -- notExists --
[01-08-2023 19:05:12] [I] com/sun/jna/not
[01-08-2023 19:05:12] [I] false
[01-08-2023 19:05:12] [I] -- notExistsFolder --
[01-08-2023 19:05:12] [I] com/sun/jna/not
[01-08-2023 19:05:12] [I] false
[01-08-2023 19:05:14] [I] 1 mods loaded.
[01-08-2023 19:05:14] [I] Server loaded. Type 'help' for help.
[01-08-2023 19:05:14] [I] Found 1 command-line arguments to parse.
[01-08-2023 19:05:14] [I] Randomized next map to be Shattered.
[01-08-2023 19:05:14] [I] Loading map...
[01-08-2023 19:05:14] [I] Map loaded.
[01-08-2023 19:05:14] [I] Opened a server on port 6567.

well, if you bring the scala code and java code to the equivalent, it would look like this:

class Test(implicit val someImplicitValue: String = "Value") {
  def someFunction = new Test()
}
class Test {
  String someImplicitValue;

  public Test(String someImplicitValue) {
    this.someImplicitValue = someImplicitValue;
  }

  public Test someMethod() {
    return new Test(someImplicitValue);
  }
}

Running a benchmark on my code, the issue is indeed the terribly-implemented O(N^2) parent/child resolution, nothing to do with entries(). It takes 132 seconds (!) to resolve parents and children for a 300-MB jar file on my machine, and there's no reason for it to be that slow.

image

Hmm, apparently this is true, the file tree is too huge, parents are being searched for too long, this is the main thing that could cause such consumption of processor time

132 seconds (stroking the cat) ._.

Reduced from 132 to 3 seconds simply by placing directories in a separate array... although it could be even faster.

there should have been something important here, but I realized that maybe you are testing my version

spin-2

Reduced from 132 to 3 seconds simply by placing directories in a separate array... although it could be even faster.

I think neither you nor I will be able to influence this, since the Zip archive is read entirely by Java (I'm not sure if it can be read in folders as a file system or if it can only be read in one piece, but if you can read in folders, then a good option it would be to implement something in c++ to speed up the reading process)

Since it reads a zip archive on a 20 MB NVMe drive in 20 ms, I doubt that it will read 300 less than 300 ms in the best scenarios

With some further "lazy" resolution of parents/children, I got it down to ~300ms to load the 384-MB jar file, call child() 3 times and get the length of a file within. This should be good enough.