mbdavid/LiteDB

[QUESTION] Best way to store hashes?

dylanstreb opened this issue · 5 comments

I'm currently storing hashes in a litedb for use as a file cache. I'm planning on expanding it, so I was thinking of revisiting the storage. Right now I'm just using base64 encoded strings. I figured that changing the storage to bytes would be more efficient.

So I wrote a quick test program to find out. I made 1,000,000 integers, hashed them, and used a simple stopwatch to compare strings and byte[]. I'm using murmurhash for this, the output is 128 bits. The models are simple:

    public class StringModel
    {
        [BsonField]
        [BsonId]
        public int Id { get; set; }

        [BsonField]
        public string? Data { get; set; }
    }

    public class ByteModel
    {
        [BsonField]
        [BsonId]
        public int Id { get; set; }

        [BsonField]
        public byte[]? Data { get; set; }
    }

The results were not what I expected:

Took: 18.4086132  byte hashes
Took: 16.8086536  base64 hashes
Bytes database filesize: 122028032
Strings database filesize: 133980160

This is just from doing a InsertBulk on the models. Inserting bytes took longer, which wasn't what I expected. The database is smaller so I assume this isn't doing any kind of binary->hex conversion for storage, but I would naively assume that the smaller bytes record would also insert faster.

Is there a better way to store small binary data in litedb? I'm more concerned about speed than filesize, so should it be left as a base64 string? Or am I setting up the models incorrectly in some way?

v4 had FileStorage available which is presumably better for storing binary blobs:
https://github.com/mbdavid/LiteDB/wiki/FileStorage

File storage appears to be for dealing with large files. This is for numerous small blobs - and since they're fixed, known length, in theory it's possible to optimize for this behavior. Doing this manually, i.e. by splitting a 128-bit hash into two 64-bit ints, doesn't help.

Storing the data as a string does seem to be the best option. I'm guessing there are just more optimizations out there (in C# and/or in LiteDB) for string processing than byte[] processing and that's what improves the performance.

Good point. I read "128 KB", not "128 bits". Sorry.

Perhaps I'm testing the wrong way, but I'm getting relatively the same file sizes and insertion times (except for the Base64 string approach) with the following:

using System.Diagnostics;
using LiteDB;

namespace TestLiteDb128
{
    interface IModel
    {
        [BsonIgnore]
        byte[]? Value { get; set; }
    }

    class Base64Model : IModel
    {
        public int Id { get; set; }
        public string? v { get; set; }

        [BsonIgnore]
        public byte[]? Value
        {
            get => v == null ? new byte[16] : Convert.FromBase64String(v);

            set
            {
                Debug.Assert(value?.Length == 16);
                v = value == null ? Convert.ToBase64String(new byte[16])
                                  : Convert.ToBase64String(value);
            }
        }
    }

    class ByteModel : IModel
    {
        public int Id { get; set; }
        public byte[]? v { get; set; }

        [BsonIgnore]
        public byte[]? Value
        {
            get => v;
            set
            {
                Debug.Assert(value?.Length == 16);
                v = new byte[16];
                if (value != null)
                    Array.Copy(value, v, value.Length);
            }
        }
    }

    class LongModel : IModel
    {
        public int Id { get; set; }
        public long lv { get; set; }
        public long hv { get; set; }

        [BsonIgnore]
        public byte[]? Value
        {
            get
            {
                var low = BitConverter.GetBytes(lv);
                var high = BitConverter.GetBytes(hv);
                var ret = new byte[16];
                Array.Copy(low, ret, 8);
                Array.Copy(high, 0, ret, 8, 8);
                return ret;
            }

            set
            {
                Debug.Assert(value?.Length == 16);
                if (value == null)
                {
                    lv = 0;
                    hv = 0;
                }
                else
                {
                    lv = BitConverter.ToInt64(value, 0);
                    hv = BitConverter.ToInt64(value, 8);
                }
            }
        }
    }

    class GuidModel : IModel
    {
        public int Id { get; set; }
        public Guid v { get; set; }

        [BsonIgnore]
        public byte[]? Value
        {
            get => v.ToByteArray();

            set
            {
                Debug.Assert(value?.Length == 16);
                v = value == null ? Guid.Empty : new Guid(value);
            }
        }
    }

    class Program
    {
        static IEnumerable<T> Generate<T>(int count) where T : IModel, new()
        {
            var value = new byte[16];

            for (long i = 0; i < count; i++)
            {
                var low = BitConverter.GetBytes(i);
                Array.Copy(low, value, low.Length);
                T model = new T();
                model.Value = value;
                yield return model;
            }
        }

        static void Test<T>(string filename) where T : IModel, new()
        {
            var stopwatch = new Stopwatch();

            Console.WriteLine($"Generating items for {filename}...");
            stopwatch.Start();
            var items = Generate<T>(1000).ToList();
            stopwatch.Stop();
            Console.WriteLine($"Generated items in {stopwatch.Elapsed}");

            if (File.Exists(filename))
                File.Delete(filename);

            Console.WriteLine($"Filling {filename} ...");
            stopwatch.Reset();
            stopwatch.Start();
            using (var db = new LiteDatabase(filename))
            {
                var col = db.GetCollection<T>();

                foreach(var item in items)
                    col.Insert(item);
            }
            stopwatch.Stop();
            Console.WriteLine($"Filled {filename} in {stopwatch.Elapsed}");
        }

        static void Main(string[] args)
        {
            Test<Base64Model>("Base64Model.db");
            Test<ByteModel>("ByteModel.db");
            Test<LongModel>("LongModel.db");
            Test<GuidModel>("GuidModel.db");
        }
    }
}

(Yes, I know using Benchmark would have been better, but I was just trying to get a rough feel for times and sizes.)

For my test with using two longs, I made a struct instead of putting the longs directly into the model. I'm guessing that's the difference.

GUID I hadn't even considered trying.