tedsmith/quickhash

Quickhash crashes when saving results as HTML for enormous volumes of files

Closed this issue · 3 comments

A user reported problems saving to HTML via the FileS tab when hashing many hundreds of thousands of files.

This kind of issue had been reported prior to v3.3.0, which I thought I had fixed by introducing the use of filestreams if the row counts was larger than 20K. So I was confused as to why the user was reporting crash outs.

So I generated a set of 400K files using Teds Tremendous Data Generator. Went to save as HTML. And QH crashed, just as the user had found. I inspected the code and observed that if the row counts was greater than 20K, it is to use file streams.

So, I created another seperate set of 22K files to see what would happen, and to check the trigger for using filestreams was being realised. I expected that it would. QH hashed them, and I went to save as HTML. It completed BUT only because the number of 22K rows was manageable for the RAM. The NoOfRows variable held the value of "20" instead of "22000". I realised what has happened. For v3.3.0, I changed the function CountGridRows to use .RecordCount. What I had missed was this, from the documentation for TDBGrid.TDataSet :

"RecordCount is the number of records in the dataset. This number is not necessarily equal to the number of records returned by a query. For optimization purposes, a TDataset descendent may choose not to fetch all records from the database when the dataset is opened. If this is the case, then the RecordCount will only reflect the number of records that have actually been fetched at the current time, and therefor the value will change as more records are fetched from the database. Only when Last has been called (and the dataset has been forced to fetch all records returned by the database), will the value of RecordCount be equal to the number of records returned by the query." (https://lazarus-ccr.sourceforge.io/docs/fcl/db/tdataset.recordcount.html)

So, RecordCount was only storing what was SHOWN, instead of what existed! So instead of showing 400K, when I asked for the RecordCount, it was showing something like 25 or whatever was on screen. As such, the check to determine whether to use a Filestream or a RAM based approach would always fail, and it would always use the RAM approach. So filestreams were not being used at all, as it turned out, for HTML output, of large data sets, despite my adding it previously. I do remember testing it, but perhaps because I was messing about with the "Go to Start" and "Got to End" GUI elements when I did that, it was calling Last and First as part of that, and as such triggering the use of streams FOR ME. But not when used in the wild by users who were not doing that. So if the user had 400K rows, and didn't use the Start and End Window buttons, RecordCount was only storing a figure of what was on screen. So when it was trying to save as HTML, it was using the RAM approach, and it then adds all the HTML tags to each of the cells of every row, so the RAM was quickly getting exhausted and causing QH to crash. So there we are.

Will be resolved in v3.3.1

The latest draft commit of v3.3.1 code seems to have helped with the row count, but the same error with RAM consumption occurs later in the loop as First and Last are called later. So still need to resolve.

This saga has continued. The row count was just the start of the issue. Having solved that using a dedicated TSQLQuery, I realised the problem is hit again later on in the save function due to further use of EnableControls, DisableControls, .First and .Last. So after some community advice I have now adjusted and conducted major changes to functions SaveFILESTabToHTML SaveCOPYWindowToHTML SaveC2FWindowToHTML to use their own TSQLQueries too, instead of DBGrid queries. All three can now handle many thousands of rows more easily and are executed in just a few seconds. A test of 407K rows was saved as a 56Mb HTML file in under 10 seconds. QuickHash can create and save that many times faster than web browsers like Firefox can open them. So we are making progress. The latest code commit of v3.3.1 is committed. But some further tweaks to address before final compilation and release

Fixed in v3.3.1