[Question] How to produce dataset with millions of data in batches

Question

[Question] How to produce dataset with millions of data in batches

Q-Bug4 opened this issue 7 months ago · 4 comments

Hi, we love using hollow, it is very nice.

I wanna know if there is a properly way to produce data in batches? Like I have 10 million objects to produce, I wanna produce them divided into 10 parts and produce 1 million objects every time. I need to produce data in batches because my vm does not have enough memory to store 10 million objects.

I am using Incremental and withNumStatesBetweenSnapshots to make it publish snapshot only at begining and at last so that it run like "in batches". But I met a problem that sometimes the Incremental did not publish dataset because some batch do not change the dataset.
I have fork hollow-reference-implementation and make 2 test cases to show what we are looking for. You can check my test cases: ProducerTest

Answer 1 · 2024-07-19T20:35:02.000Z

@Q-Bug4 Were you able to find a solution?

Answer 2 · 2024-07-23T05:48:14.000Z

@prasoon16081994 @Q-Bug4 Were you guys able to find any solution for this.

Answer 3 · 2024-07-23T13:01:45.000Z

@prasoon16081994 @shyam4u Not yet.
Now we are still using Incremental to get close to it. The "dataset not changed" issue can be avoided if you check the version at each producing time. Example below:

List<Data> datas = getMillionDatas();
Incremental incremental = getIncremental();

long lastVersion = 0;

for (...) {
    let version = incremental.produce(...);
    // check if version changed, if not, means dataset not changed.
    if (version == lastVersion ) {
        throw new RuntimeException("dataset not changed!");
    }
    lastVersion = version;
}

Answer 4 · 2024-07-30T19:07:35.000Z

@Q-Bug4
Hey, finally developed a halfway decent understanding of hollow.
Isn't is the correct implementation to not write a delta if the dataset didn't change?

Perhaps if each of the records were needed, there could be some field that could be added to the record that is unique across all records, and marked as the primary key.