felixguendling/cista

Custom serialization of std::map

Closed this issue · 4 comments

Hi Felix,

I want to serialize an ordered map. Because the library doesn't provide such container I would use std::map.

using ordered_map = std::map<std::uint32_t, cista::offset::hash_set<int>>;

The documentation doesn't cover such topic and I'm asking:

  1. I think this serialization and deserialization of needs raw mode instead of offset mode (and offset mode just for mapped value).
    How the custom serialization and deseriation functions would look like?
    I would be a good point in documentation the custom serialization and deserialization of a std::map.

  2. Is it possible to include such map in a structure serializing and deserializing the entire structure? Eg.

struct my_data
{
    ordered_map map;
    cista::offset::vector<int> array;
};
  1. If this is possible, how the serialization and deserialization functions would look like?

Regards
Flaviu

I think the same approach that has been used for std::vector<T> would also work for std::map<K, V>:
https://github.com/felixguendling/cista/blob/master/test/std_vector_test.cc

However, this is by no means efficient. You use the memory space of the std::map<K, V> to store a pointer to a continuous memory region (containing the elements of the map) and the length. When serializing, you copy each element to this memory region (recursively serializing all the elements). At deserialization you need to insert each element from this memory region into the map. So deserialization won't be very fast with this approach.

Another question: why do you need an ordered map? Probably to iterate over the elements in the sort order and at the same time you need fast access based on the the map key? Maybe you can achieve something similar with this approach:

namespace data = cista::raw; // can also be cista::offset

template <typename K, typename V>
struct helper {
  data::vector<cista::pair<K, V>*> ordered_;  // for ordered access
  data::hash_map<K, V> random_access_;  // for random access
};

The best solution (performance wise) would probably be to replicate abseil::btree in Cista like I did for cista::hash_map.

Good to know about std::vector example.
I have to select objects sorted by time stamp using a given time frame. I was thinking to

template <typename K, typename V>
struct helper {
  data::vector<K> ordered_;  // for ordered access
  data::hash_map<K, V> random_access_;  // for random access
};

considering usual O(1) lookup time for hash map. But is it O(1)?
Perhaps data::vector<cista::pair<K, V>*> would provide better performance vs additional hash map lookup.
By using cista::pair<K, V> pointer to data::hash_map<K, V>, is the pointer stability guaranteed if the hash map is changed?

As a general topic, the advantages of serialization and deserialization of cista library are known. But what are the down sides? What are the penalties if most of the time operations over cista containers are performed and deserialization & serialization is only done at startup and end of application. It would be useful having the possibility comparing the overall performance of std containers + slow, naive, serialization vs cista containers + offset serialization. Maybe a table like this for cista containers will bring deeper insights.

Thanks
Flaviu

But is it O(1)?

Yes.

is the pointer stability guaranteed if the hash map is changed?

No. The cista::hash_map is based on the ideas of Google's Abseil hash map. Both (Cista & Abseil) do not provide pointer/iterator stability when changed (i.e. calling non-const member functions).

std containers + slow, naive, serialization

This is basically what cereal is doing. You can see the performance in the comparison table.

Maybe a table like this for cista containers will bring deeper insights.

There is no difference in the Big O notation between Cista vs. STL. So the table looks the same.
However, since the Cista hash map is based on the Abseil hash map, it should perform better.

What are the penalties if most of the time operations over cista containers are performed

Basically, this depends on the type of application (data structures, traversal patterns, etc.) you have. Therefore, you need to measure this for your specific application.
For a graph traversal algorithm, this has been done. The results are shown in the existing comparison table. If you want to see the difference regarding traversal ("usage" vs. "serialization" & "deserialization"), the table has a traversal column. Here, you can see that Cista outperforms the std:: data structures (which are used by cereal) in raw mode (112ms vs 125ms) - probably because of better data locality. For offset mode, Cista is slightly slower (132ms vs 125ms) because for each pointer access (operator* and operator->) we need to compute the address (sum of this and the stored offset).

Overall: if you use cista::raw mode you get raw performance (same as std or even better). If you use cista::offset mode you have faster deserialization (essentially nothing to do!) but pay a slight performance penalty at usage (~5% slower graph traversal - YMMV for other types of applications!).

Thanks a lot for detailed response.

I have to check all the aspects of using two cista containers (sorted vector & hash map) vs using a single container with slow serialization.