kristian/minify-xml

file size limited to 2GB

Closed this issue · 9 comments

I had a file which is around 3.2GB and the minify-xml had failed to process this file.
The exception message:

RangeError [ERR_FS_FILE_TOO_LARGE]: File size (3441779227) is greater than 2 GB

Stacktrace:

node:fs:416
      throw new ERR_FS_FILE_TOO_LARGE(size);
      ^

RangeError [ERR_FS_FILE_TOO_LARGE]: File size (3441779227) is greater than 2 GB
    at new NodeError (node:internal/errors:363:5)
    at tryCreateBuffer (node:fs:416:13)
    at Object.readFileSync (node:fs:461:14)
    at Object.<anonymous> (C:\Users\mofaisal\AppData\Roaming\npm\node_modules\minify-xml\cli.js:106:23)
    at Module._compile (node:internal/modules/cjs/loader:1095:14)
    at Object.Module._extensions..js (node:internal/modules/cjs/loader:1124:10)
    at Module.load (node:internal/modules/cjs/loader:975:32)
    at Function.Module._load (node:internal/modules/cjs/loader:816:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:79:12)
    at node:internal/main/run_main_module:17:47 {
  code: 'ERR_FS_FILE_TOO_LARGE'
}

Is it possible to enhance this tool to process large files?

Unfortunately not. minify-xml is using some very enhanced regular expressions to perform its replacements. Especially to check, for instance, if the match is in a CDATA-block or not, having the data in a string is currently required. The maximum size of buffers and thus also string is 2GB at the moment, see this stackoverflow answer.

One could try to make a reduced feature set work with packages like replacestream or stream-replace, which would also work on larger files. As said above, the whole feature set would definitely not work and also it would be quite a huge change to the library. Something on the roadmap, but definitely nothing "out of the box" unfortunately. Sorry.

I will give it a quick try this weekend... this will determine if it makes sense to add a streaming functionallity to this library, or not. I will let you know in this issue.

Thank you @kristian for your response. And I wish you Good luck for trying out that streaming functionality.

I think this should be feasible, with a reduced feature set (e.g. namespace minification requires prior knowledge about the file, which namespaces are in use, these features will not be evailable for a streaming enabled API). I will close this issue, as soon as I added support for streams.

Streaming support added with c550302 and v3.0.0 release on NPM. Hope this helps!

Hi @kristian,
Thanks for swift update on implementation. However, I found that it is taking too long to process the large input file. It had already taken more than 20 min to process a 3.2Gb file and the process is still running.

Hello @faisal6621, I did add a small CLI progress indicator to version 3.2.0, feel free to test it.

The file is streamed right through to the output file, so you should see the output file size growing while the input is beeing consumed.

Minifying 3.2GB of XML is no easy feat though. Literally hundrets of thousands regular expression replacements will be performed, in order to minify that, so it likely much depends on your systems performance, on how well it is to handle the minification.

On a test file I created with 3.5GB, it took about 26 minutes to get minified on my machine. Hope this helps.

Hi @kristian,
Yes, I see the output file size keeps growing during the process. However, even with the version 3.2.0 I do not see any progress indicator either in git-bash or the windows command prompt.
Also, it took somewhat similar time (more than 20 min) for processing the 3.2Gb of file.

@faisal6621 in order to see a progress indicator you should choose to be using the --output / -o and the --stream / -s option, instead of piping the output to a file via the CLI if possible.

Does it actually finish the 3.2 GB file in 20 minutes, or does it result in an error / no result minifying the 3.2GB file? Generally if you tell me 3.2 GB takes 20 minutes and that is actually too slow, there is nothing I can do to optimize this. Optimizing XML is generally not an easy thing to do, if you want to stay compliant to the XML specs, a lot of parsing is required as it is not as simple as to scan for closing > tags in the document, because closing tags > are actually not a protected character in XML. Thus finding the pairs, you always have to start parsing the whole tag like <name ... any number of attributes ...> that is the only thing how you can parse XML spec. compliant. That's what a lot of regex-magic gurantees in minify-xml. To stay spec. compliant there is not way to speed this up for this library. Either you do it like I did in minify-xml, or you parse and interpret the whole XML into a DOM structure (which I am quite sure would also fail for a 3.2 GB file), or you do it spec. incompliant, which well, this library specifically wasn't designed for. Streaming has the limitation of having to rely on a certain "window-size" (you can change that setting with the streamMaxMatchLength option. So everything can only be optimized in that window. Thus if you expect to optimize a huge tag, say <xml>... two gigabytes of spaces ...</xml> into a <xml/>, this is nothing that streaming support can do, because again those two gigabytes of spaces will NOT fit into a NodeJS string. NodeJS for x64 currently caps out at 2 GB, so this is the maximum "window-size" you can choose for streaming, noting however that this will make the whole optimization even more inefficient.

So please narrow down the issues you see in the streaming support for this library, however knowing that there are certain technical limitations, that a library which respects the spec. of XML cannot overcome with the limtations given by NodeJS.