flash-oss/medici

FastBalance approaches and solutions

Uzlopak opened this issue · 23 comments

I thought about this issue multiple times, and these were my "simple" solutions.

  1. Account Table
    All the transactions will be summed up in an account document. Potential issue is the writeLocks on heavy used accounts.

  2. Create a balanceSumUp collection (UPD: we have implemented this solution)
    The actual issue of balance is that the operation is O(N) where N is the amount of transactions assigned to an account. So probably the easiest solution would be to make a balance call and store the value into the balanceSumUpTable. So the BalanceSumUp would contain the last transactionId of the balance method, the sum of the balance, and an expiration value, e.g. daily. So what happens is, thay we first search for the balance in the dailyBalanceSumUp Collection. If we find it, we determine the balance and the transaction id we stored. We then do the balance but the _id has to be bigger than the last _id of balanceSumUp. Probably needs an appropriate index. But what happens is that if we have balance of an Account with 1000000 transactions we would not read 1000000 transactions but e.g. the last 1000 of the 1000000 transactions. Thus reducing the Runtime to O(n+1), where n is the amount of additional transactions since the persisted balance. If you set the expires (when the document is automatically deleted by mongo) to 1 day than you would have only once a day the slow down to run the whole balance. Or you set the expires (mongo expires) to 2 days and add a second expires to 1 day, but this expires does not result to a automatically deletion but indicates the freshness of the balance. So you take the last persisted balance, check if it is fresh enough, if so you calculate the balance since then. If it is not fresh, you persist the new balance to the collection, were you update freshness and expires. So you have a potential write conflict when writing to the persisted balance resulting in a retry to persist the balance?! But only once a day.
    Or you don't expire at all and do just the freshness check.

This would not make it necessary to store additional information to the transactions. And it should be still faster than traditional balance.

Ok. I gave it a very deep thought. :) Well, as deep as I can at midnight.

I think your medici_balances idea is the best. (Yeah, let's rename balanceSumUp -> medici_balances.)

Few random thoughts.

  • I would add "account name" to the below list, and possibly the "timestamp of that transaction".

would contain the last transactionId of the balance method, the sum of the balance, and an expiration value

  • Maybe, just maybe, we would not need to create more than one doc in medici_balances.

  • I would not recommend looking up last 1000 of 1000000 transactions

Thinking to start implementation of your idea tomorrow.

I thought about it. The approach seems to be problematic. There is no guarantee that transactions are always sequentially as e.g. the timestamp is generated in JavaScript with local time of node.
Also ObjectIds are created with a timestamp so they are not sorted incremental.

So if one service is for some reason off some minutes, than we have an issue.

So we still risk wrong balances with this approach.

Even if we use the amount of notes/transactions we could get in some issues as it could still result in a wrong balance.

So the partial balances approach is still racy.

Maybe there is no other way than track account changes always to avoid wrong data. On the other hand this can result in heavy database degradation if one account is heavily used.

Also ObjectIds are created with a timestamp so they are not sorted incremental.

AFAIK, in Medici the:

  • datetime is the time when real-world operation happened. For example, if a payment was done yesterday evening but for some operational reasons we are recording it only today morning.
  • timestamp is the time we added the record to the DB. This was added before mongoose had createdAt auto-insertion.

We should not rely on any of the above Date fields as they are not enough granular.

This ordering issue is solvable in at least two ways.

1. Generate _id server side only.

The _id can be generated DB server side if not provided by the client (driver). They are sorted incrementally.
https://docs.mongodb.com/manual/reference/method/db.collection.insertOne/#_id-field
In other words, we can be pretty safe betting on _id being always properly sorted.

Mongoose does not support forceServerObjectId: true driver option (it force overrides to false). Thus, we would have to maintain second connection to the DB which would have forceServerObjectId: true set.

In Entry.commit() we should replace this:

await Promise.all(this.transactions.map((tx) => new transactionModel(tx).save(options)));

With something like this:

const txModels = this.transactions.map((tx) => new transactionModel(tx));
for (const txModel of txModels) {
    const err = txModel.validateSync();
    if (err) throw err;
}

const { insertedIds } = await mongoCollection.insertMany(this.transactions);

2. Utilise $currentDate: {$type: "timestamp" }.

Create new field (e.g. ts: new Timestamp()).

Quoting this page:
https://docs.mongodb.com/manual/reference/operator/update/currentDate/

The $currentDate operator sets the value of a field to the current date, either as a Date or a timestamp.

Timestamps

  • the most significant 32 bits are a time_t value (seconds since the Unix epoch)
  • the least significant 32 bits are an incrementing ordinal for operations within a given second.

Again, mongoose does not support Timestamp type. We would have to maintain second connection to the DB.

Mongoose is the problem

At this point I see mongoose more as a problem. I am considering removing mongoose entirely. Here is why:

  1. It looks like a significant effort in our codebase to migrate mongoose 5->6. Meaning, we won't be able to use latest Medici if its dependency is Mongoose v6. It could be the same for other projects.
  2. To implement proper FastBalance the Mongoose is on our way. (See above.)
  3. There are two features Mongoose helps us with: field data type consistency and defaults values, data validation. Reimplementing all that sounds feasible.

What do you think @Uzlopak ?

Another feature we can try dropping is the approve. We are not using. I don't know anyone using it.

1. Generate _id server side only.

Looks like we don't need to maintain second connection. We can reuse mongoose's driver connection with one option:

transactionModel.collection.insertMany(this.transactions, {
  forceServerObjectId: true,
})

Yes. If you dont provide an _id and with forceServerObjectId it will generate the objectids on the server side. But it works as long as you don't use sharding. When you start using sharding the mongo instances can have issues with different times.

Also if you want to use timestamps like you mention, probably doesnt matter if you just use mixed as schema type for it. As long as you don't want to read it and only want to sort by it, it should be no issue. But the timestamp has to be set by the server.

Do the unit tests run though if you use mongoose v5?

Also be careful. I wrote the stress tests because mongo transactions need atleast one awaited operation, before you can send the rest in a Promise.all. e.g. that's why I store first the journal and then Promise.all the transactions. The other way round would result in transaction errors. If you want to use server side ObjectIds you would need again to store first all the transactions, and then extract the object IDs, assign them to the journal and then store the journal.

Maybe it makes sense to set mongoose to ">= 5" or "*" and remove the package.lock.json from the repo/npm package. Thus making the mongoose version irrelevant. We did not use any specific API from mongoose I think. We should still check our code with monoose 5 but there is no special code used. Only mongoose 6 has some more typings, like the pipeline.stage interfaces, which was only added by me to figure out why it broke from version 6.0 to 6.1.2. with mongoose 6.1.3 that speciifc type error got fixed, so we can remove the PipelineStage interfaces again, as they are not critical and the typings will be anyway checked implicitly in the aggregation method.

I removed the pipelinestage interfaces.

So I think mongoose 5 should work with the unit tests also.

Do the unit tests run though if you use mongoose v5?

No.

Btw, FYI we are not using TypeScript ourselves.

Regarding forceServerObjectId. I read the docs. Looks like forceServerObjectId and $currentDate have exactly the same shortcomings. Moreover, $currentDate have 1000 times higher chances of problems because server-generated _id granularity is based on milliseconds, but Timestamp is based on seconds.

Regarding replicaset issues. Typically only one node is a write-node. Well, at least that's what we have. So, the _id's will be generated always by the same node.

I replaced multiple .save() calls with single insertMany().

=====

Here is how what I did to the .commit() method:

Step 1 - validate all the data using mongoose model.

      const txModels = this.transactions.map((tx) => new transactionModel(tx));
      for (const txModel of txModels) {
        const err = txModel.validateSync();
        if (err) throw err;
      }

Step 2 - bulk insert (single operation!) all the ledge entries.

      const result = await transactionModel.collection.insertMany(this.transactions, {
        forceServerObjectId: true, // This improves ordering of the entries on high load.
        ordered: true, // Ensure items are inserted in the order provided.
        session: options.session, // We must provide either session or writeConcern, but not both.
        writeConcern: options.session ? undefined : { w: 1, j: true }, // Ensure at least ONE node wrote to JOURNAL (disk)
      });
      let insertedIds = Object.values(result.insertedIds);

Step 3 - read generated _ids

      if (insertedIds.length === this.transactions.length && !insertedIds[0]) {
        // Mongo returns `undefined` as the insertedIds when forceServerObjectId=true. Let's re-read it.
        const txs = await transactionModel.collection
          .find({ _journal: this.transactions[0]._journal }, { projection: { _id: 1 }, session: options.session })
          .toArray();
        insertedIds = txs.map((tx) => tx._id as Types.ObjectId);
      }

Step 4 - save the journal.

      // @ts-ignore
      (this.journal._transactions as Types.ObjectId[]).push(...insertedIds);
      await this.journal.save(options);

Sounds fine. We should then document that we don't use mongoose operations so using mongoose hooks in schemas like save middleware won't be hit anymore.

Also the replicaset issue is different from sharding I think. But on the other hand, if it is shard specific it should be no issue.

Why does insertMany does not contain the insertedIds? According to the docs it should contain the insertedIds?

https://docs.mongodb.com/manual/reference/method/db.collection.insertMany/#examples

Also it sounds critical that you check if length of transactions is the same as insertedIds. What if they are not the same, for any reason. It should throw if it is not the same length just to avoid inconsistencies.

Why do you do Object.values on insertedIds? InsertedIds should be already an Array of ObjectIds? Maybe this is the reason that ObjectIds are undefined?
Sorry I can't test stuff on my mobile :).

Ok, I get it.

PS: This server side objectid bug by mongo is actually pretty bad.

Fast balance was implemented.
Also, I've added mongoose v5 support (now both v5 and v6 are supported).
And published: npm i medici@next

Gonna try it on large codebases.

Ok. I've published the npm i medici@next. Actually, 8 of them. But I found a large number of bugs (mostly introduced by my refactorings). All the bugs are fixed now.

The medici v5 is 100% compatible with our codebases (well, except book -> Book module exports rename).

Going to publish the 5.0.0-next.9 as 5.0.0 next week.

Houston, we have a problem.

syncIndexes kills existing indexes. This must not happen! People (us) might have any other arbitrary indexes on "transactions" collection. Moreover, the README recommends which other indexes you better add manually.

The solution I see so far...

The syncIndexes must not use the mongoose model.syncIndexes. Instead, it should drop all indexes which rely on approved property, and recreate them (and only them) without the approved.

Also, it should apply other default indexes.

transactionSchema.index({ _journal: 1 });
transactionSchema.index({ datetime: -1, timestamp: -1 });
transactionSchema.index({ accounts: 1, book: 1, datetime: -1, timestamp: -1 });
transactionSchema.index({ "account_path.0": 1, book: 1 });
transactionSchema.index({ "account_path.0": 1, "account_path.1": 1, book: 1 });
transactionSchema.index({ "account_path.0": 1, "account_path.1": 1, "account_path.2": 1, book: 1 });

Here is the code to retrieve info about all indexes:

db.medici_transactions.getIndexes();

I actually recommend to drop the indexes. If you want to have a transition then use setTransactionSchema and add the old indexes, so they get not dropped. Then mongo will just add new indexes. Then in a week you drop the old indexes.

Our system will stop functioning if we drop any single custom index at any point of time.

Sorry, your recommendation is not possible.

The syncIndexes must be rewritten or removed from the package.

You are not forced to call syncIndexes. And also syncIndexes will not drop any of the old indexes, if you define the old indexes in your custom Schema when using setTransactionSchema.

You are not forced to call syncIndexes.

This is incorrect. I am forced to sync indexes because we dropped approved property which is a part of several indexes.

And also syncIndexes will not drop any of the old indexes, if you define the old indexes in your custom Schema when using setTransactionSchema.

We are not using any custom schema. (Moreover, I believe "custom schema" should be removed from medici later.) So, in our case the syncIndexes does drop all the indexes.

Moreover, mongoose devs do not recommend using their index auto creation in production. Quoting:

When your application starts up, Mongoose automatically calls createIndex for each defined index in your schema. Mongoose will call createIndex for each index sequentially, and emit an 'index' event on the model when all the createIndex calls succeeded or when there was an error. While nice for development, it is recommended this behavior be disabled in production since index creation can cause a significant performance impact. Disable the behavior by setting the autoIndex option of your schema to false, or globally on the connection by setting the option autoIndex to false.

Let's quickly choose which way to go:

a) vivid warning in readme and JSDoc that syncIndexes will destroy any custom indexes. And provide the detailed information on how to properly migrate indexes between v4 and v5.
b) rewrite syncIndexes to intelligently remove old outdated indexes and create new.

The new MongoDB 5.0 "Time series" feature could be what we need: https://docs.mongodb.com/manual/core/timeseries-collections/

Thoughts?

Time series collections efficiently store sequences of measurements over a period of time. Time series data is any data that is collected over time and is uniquely identified by one or more unchanging parameters. The unchanging parameters that identify your time series data is generally your data source's metadata.

When you query time series collections, you operate on one document per measurement. Queries on time series collections take advantage of the optimized internal storage format and return results faster.

The implementation of time series collections uses internal collections that reduce disk usage and improve query efficiency. Time series collections automatically order and index data by time.

I went with the option a) - updated the README.

There is another idea for balance snapshot - read-only views.

The balance query can be stored as a read-only view. So that medici wouldn't need to maintain the medici_balances collection. MonoDB would do in the background it instead.

Released as v5.0.0

npm i medici@latest

At last! :)