josiahcarlson/rom

Pipelining commands to the redis server: A ManyToMany use case with massive insertion of data

Closed this issue · 7 comments

Pipelining commands to the redis server: A ManyToMany use case with massive insertion of data

Thanks, it is a tricky situation, when you load data on a clear system then it's probably better if they are deduplicated first so that saving is not raising unique exception (that is a requirement of TRIADB, it works with single instances for each data type domain, collection).
In that case there is also no need to look for existing, already saved values in the system. Then my question is whether I am going to reduce significantly round-trips if I use_rom_session() instead of use_null_session() to save all newly created instances.

PS: Current version of TRIADB is based on 3D numerical vector primary keys and therefore ideally there should be a different mechanism of creating instances on the server side, thus no need for ROM ID primary key. But this is a "change rom" situation as you say.

Yes, your python set cache might be useful if its size is not too big, otherwise membership check will become resource-expensive. My previous benchmarks were based on TRIADB add_datum (value), add_item, add_row logic with only get_by lookup but for bulk insertion it seems a columnar approach can be proven to be advantageous. For example index selectivity is an important parameter and the execution has to be planned and optimized accordingly. I have to run more tests to measure that.

Petl or Panda is a pythonic way for reading flat files, for sorting columns, for returning unique values. But I am now convinced that for TRIADB column-loading/processing is absolutely necessary for improving performance due to missing values and the requirements of having single instance values, i.e. data type domain, and attributes with a domain, i.e. collection of items (set), connected in a graph. I will make a demo and explain how all these work when I release the new version of TRIADB software.

PS: Previous implementation of TRIADB in Intersystems Cache was about two times faster in loading data row-wise but still slow. I guess this had to do with the code I wrote inside the server of Cache using their native Cache language. You may probably say that server-side code is the extreme you can go after, using a specific vendor's solution for back-end datastore/database.

Further to our discussion, you might also find this massive insertion benchmarks I ran in both ROM and RediSearch. I have drawn the conclusion, see this comment, that Redis is rather slow for massive insertion of data when you use it as a data store with a single TCP/IP connection. My previous non-optimized tests on IntersystemsCache - Python verify that. It also allocates a lot of memory in comparison with numpy data structures

Yes, memory storage (data representation) and massive insertion of data are related problems. From the developer's point of view I would expect to load a dataframe (table) of data on a client-server system as fast as possible and if it's a memory data store I would expect it also to fit in a minimum size including indexes.

You have referred to fast memory data structures like Numpy, recently I started studying such software tools. There is already a good established trend to use Apache Arrow tools for fast in-memory processing and interprocess communication. Apache Arrow acts as a high-performance interface between various systems.

My work starts at the staging area, i.e. management of data resources, data models, mapping between the two, loading data/indexes in memory and then performing queries (filtering) and aggregations. It would be nice to see if we can link the two projects.

You may find this last comment I wrote here relevant, there are many puzzles to solve in TRIADB associative - semiotic - hypergraph engine. One of them is fast construction of bitmap secondary indices. I would prefer to demonstrate that to you and others instead of explaining it with words and ask for your valuable assistance when I am in need to solve a very specific implementation task.

Hi, @josiahcarlson I think I owe you a final reply here, since you are complaining about me in public.

  1. Yes if one is interested in an open-source project he asks questions, suggests features to add, etc...
  2. That is not true, revisit my github issues/responses. Do whatever you like with your software. I have only tried to use ROM in my project.
  3. The links here have been added for the specific issue I had with ROM, and Redis in general, about very slow massive insertion and memory allocation.

And finally, If you see the issues I opened, four out of seven have been marked with a bug tag. So I think I helped you this way to improve your software. And you have also taken plenty of feedback through Q&A we exchanged. And yes, you helped me because I helped you too, end of story. Wish you best luck with your new venture.