Performance degradation in recent commits

I am trying to take a look at the performance degradation in the last few commits.
Regardless of the results, I would like to share the initial benchmark that I created.

Introduction
There are several test projects in the openmcdf solution that have name/intention to be performance targeting. I might be missing something, but it seems like they are more like an internal tool used by core maintenance team and not an attempt to have a benchmark test suite of any kind. In addition, the lack of the standard format for benchmark results makes it difficult to reproduce results on different machines and difficult to communicate the changes/finding/improvements.

Assumptions:

To get stable results I would like to focus on the file that resides completely in memory
The focus of my first scenario is purely on the reader perfromance.

Goals:

Provide a simple way to point out the irregularitites
Establish a framework for justification for performance changes

In addition, my initial scenario is very simple.

Create in memory a file with a single storage and a few streams of 1 MB each
measure reading a stream using preallocated buffer of different sizes (it means i would expect that ideally, no additional memory should be allocated and performance should great )

Here are the starting numbers, right of the master
Notice that we read same amount of information and the results are sorted from the fastest to slowest. Last lines are the problematic one.

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17134.648 (1803/April2018Update/Redstone4)
Intel Core i7-8850H CPU 2.60GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview3-010431
  [Host] : .NET Core 2.2.3 (CoreCLR 4.6.27414.05, CoreFX 4.6.27414.05), 64bit RyuJIT
  Core   : .NET Core 2.2.3 (CoreCLR 4.6.27414.05, CoreFX 4.6.27414.05), 64bit RyuJIT

Job=Core  Runtime=Core

Method	BufferSize	TotalStreamSize	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Test	1048576	1048576	309.6 us	6.064 us	5.672 us	188.4766	93.7500	-	1.11 MB
Test	524288	1048576	333.9 us	6.485 us	7.468 us	195.8008	97.6563	-	1.12 MB
Test	262144	1048576	403.8 us	5.643 us	5.002 us	213.8672	106.9336	-	1.16 MB
Test	131072	1048576	436.4 us	2.850 us	2.666 us	210.4492	104.9805	-	1.22 MB
Test	4096	1048576	5,211.3 us	58.252 us	48.643 us	1000.0000	382.8125	-	5.42 MB
Test	1024	1048576	20,646.4 us	409.656 us	613.153 us	3343.7500	593.7500	125.0000	18.42 MB
Test	512	1048576	39,732.5 us	791.207 us	1,385.735 us	7000.0000	500.0000	214.2857	35.74 MB

Quick investigation led to an hypotisis that
private static void EnsureUniqueSectorIndex(int nextSecID, HashSet<int> processedSectors) in the file CompoundFile.cs is one of the causes of performance slowdown. Quick review of the area shows that HashSet without proper capacity set will be constantly growing and reallocating internal arrays. Which in turn will cause GC pressure and might slow down the performance.

To validate my hypothesis i made a single change

 private static void EnsureUniqueSectorIndex(int nextSecID, HashSet<int> processedSectors)
        {
	        //if (processedSectors.Contains(nextSecID))
	        //{
		       // throw new CFCorruptedFileException("The file is corrupted.");
	        //}

	        //processedSectors.Add(nextSecID);
        }

And results are way better

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17134.648 (1803/April2018Update/Redstone4)
Intel Core i7-8850H CPU 2.60GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview3-010431
  [Host] : .NET Core 2.2.3 (CoreCLR 4.6.27414.05, CoreFX 4.6.27414.05), 64bit RyuJIT
  Core   : .NET Core 2.2.3 (CoreCLR 4.6.27414.05, CoreFX 4.6.27414.05), 64bit RyuJIT

Job=Core  Runtime=Core

Method	BufferSize	TotalStreamSize	Mean	Error	StdDev	Median	Gen 0	Gen 1	Gen 2	Allocated
Test	1048576	1048576	263.8 us	5.409 us	6.012 us	262.6 us	181.6406	90.3320	-	1.09 MB
Test	524288	1048576	290.8 us	4.712 us	4.407 us	291.4 us	183.1055	91.3086	-	1.1 MB
Test	262144	1048576	347.4 us	6.872 us	9.407 us	344.8 us	190.4297	94.7266	-	1.11 MB
Test	131072	1048576	399.2 us	7.768 us	13.399 us	394.5 us	199.7070	99.6094	-	1.13 MB
Test	4096	1048576	3,337.0 us	66.090 us	127.332 us	3,278.9 us	398.4375	199.2188	-	2.27 MB
Test	1024	1048576	12,268.9 us	199.769 us	186.864 us	12,289.2 us	1000.0000	453.1250	-	5.83 MB
Test	512	1048576	23,864.0 us	464.640 us	456.339 us	23,761.8 us	1750.0000	375.0000	125.0000	10.57 MB

Conclusion:

I would recommend to revert the commit de52b97. In general validation should be an external functionality that allows user to check integrity of the storage layer
I would like to push the benchmark into the repository (if @ironfede is ok with it)

* I would recommend to revert the  commit [de52b97](https://github.com/ironfede/openmcdf/commit/de52b9783864dceea3ca1c65334863ba63ab65b3). In general validation should be an external functionality that allows user to check integrity of the storage layer

One of the reasons for those changes were to avoid some problems with the reader getting stuck in an infinite loop and/or going round in circles before throwing OutOfMemoryException, which doesn't seem like a thing to be left to callers?

Possibly it can be done in a better way though (seems a shame that HashSet doesn't have a means of reserving capacity until the latest .NET runtime versions)

In essence, we are facing the standard fat filesystem consistency check. I don't remember by heart but I am confident that we can come with a solution ( :) this is a standard interview question to find a loop in linked list)

Benchmark project commited. @michaelp do you have the time to try somthing different from Hashset container to check for loop AND keep performance at optimal levels?

@ironfede yes I will try it this weekend. I will try Floyd's algorithm for loop detection

@ironfede
Federico, it would be nice if we could discuss a few general topics like

Blocks management
Fragmentation
Decoupling between page & IO management and logic of the file system chains which will improve performance, and unblock concurrency

As you're looking in this area, and #40 is related to the checks anyway:
If I debug the Test_CorruptedSectorChain_Doc test, then it gets to

openmcdf/sources/OpenMcdf/CompoundFile.cs

Line 1276 in a62d584

if (header.DIFATSectorsNumber != 0)

With header.DIFATSectorsNumber set to 268435456, which in #40 was causing it to run out of memory before the existing 'validationCount' check failed.
Is it possible in such cases to detect that the value is invalid before even going into the parse loop? (e.g. because there aren't that many sectors available?). (my knowledge of the format is too basic here to be sure of the best way to validate)

@michaelp sorry for delay in responses but I'm currently really busy with main work activities. I'm really interested in your points to improve performance and architectur, especially the last one.. Do you think that a separate branch could work to explore some ideas?
@Numpsy , your idea could be useful but would only work for local header corruption and not for sector chain corruption where you can't predict block linking without traversing linked list: this is the more subtle type of corruption and probably the hardest to detect without a lot of allocation in a lookup data structure (=> I'm really interested in a Floyd algorithm implementation as suggested by @michaelp)

@ironfede I am writing some tests and I need a few more evening to come with a solution to the current task. (Just keeping you in the loop). I want to deliver what I've started and then we can talk.

I do realize it has been 5 years, but did anything ever come out of this?
We are currently looking into a performance issue when creating lots of Storages/Streams, and it seems to point to the EnsureUniqueSectorIndex function.

I have used this in the OpenMcdf.PerfTest project:


class Program
{
    static int MAX_STORAGE_COUNT = 500;
    static int MAX_STREAM_COUNT = 50;
    static String fileName = "PerfLoad.cfs";

    static void Main(string[] args)
    {
        var stopWatch = Stopwatch.StartNew();
        File.Delete(fileName);
        if (!File.Exists(fileName))
        {
            CreateFile(fileName);
        }

        Console.WriteLine($"Create took {stopWatch.Elapsed}");

        CompoundFile cf = new CompoundFile(fileName);
        stopWatch.Restart();
        CFStream s = cf.RootStorage.GetStream("Test1");
        Console.WriteLine($"Read took {stopWatch.Elapsed}");
        Console.Read();
    }

    private static void CreateFile(String fn)
    {
        CompoundFile cf = new CompoundFile();

        for (int j = 0; j < MAX_STREAM_COUNT; j++)
        {
            cf.RootStorage.AddStream("Test" + j.ToString()).SetData(Helpers.GetBuffer(300));
        }

        for (int i = 0; i < MAX_STORAGE_COUNT; i++)
        {
            var storage = cf.RootStorage.AddStorage("Storage" + i.ToString());

            for (int j = 0; j < MAX_STREAM_COUNT; j++)
            {
                storage.AddStream("Test" + j.ToString()).SetData(Helpers.GetBuffer(300));
            }
        }

        cf.Save(fileName);
        cf.Close();
    }
}

Would you accept a PR for an additional Configuration flag in the CFSConfiguration enum that disables the Unique Sector Index check?

I have done some improvements that help a tiny bit at least...
#132 and #133

My test on my machine is like this, while having validation enabled (properly) or disabled:

Create with validation took 00:00:39.3890968
Create without validation took 00:00:30.0826849

That's a nice 30% improvement

I wanted to have a look at the earlier suggested algorithm for detecing loops at some point, but then fell over a number of bugs that needed to be fixed first, and never got to it :-(