apache/shardingsphere

IntervalShardingAlgorithm performance is too bad

Ahoo-Wang opened this issue · 10 comments

Feature Request

When I was about to use IntervalShardingAlgorithm to integrate CosId, I checked the source code and found the following problems:

  • Ease of use: The IntervalShardingAlgorithm implementation is to first convert to a string and then convert to LocalDateTime, the conversion success rate is affected by the time formatting characters
  • Performance: The method of nested traversal checking determines whether the conditions are met, and it is accompanied by the conversion process of LocalDateTime. The performance problem is fatal. The performance of PreciseShardingValue is even lower than 7000 ops/s (it is lower than the storage layer MySql, ShardingSphere-JDBC becomes the bottleneck, which is obviously unbearable)

Code implementation of benchmark report

I even doubt whether there is a problem with the way the benchmark report is implemented,But I have tried my best to eliminate the test noise

Interval-based-sharding-algorithm-jmh

gradle cosid-shardingsphere:jmh
# JMH version: 1.29
# VM version: JDK 11.0.13, OpenJDK 64-Bit Server VM, 11.0.13+8-LTS
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/work/CosId/cosid-shardingsphere/build/tmp/jmh -Duser.country=CN -Duser.language=zh -Duser.variant
# Blackhole mode: full + dont-inline hint
# Warmup: 1 iterations, 10 s each
# Measurement: 1 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
Benchmark                                                          Mode  Cnt         Score   Error  Units
IntervalShardingAlgorithmBenchmark.cosid_precise_local_date_time  thrpt       66276995.822          ops/s
IntervalShardingAlgorithmBenchmark.cosid_precise_timestamp        thrpt       24841952.001          ops/s
IntervalShardingAlgorithmBenchmark.cosid_range_local_date_time    thrpt        3344013.803          ops/s
IntervalShardingAlgorithmBenchmark.cosid_range_timestamp          thrpt        2846453.298          ops/s
IntervalShardingAlgorithmBenchmark.office_precise_timestamp       thrpt           6286.861          ops/s
IntervalShardingAlgorithmBenchmark.office_range_timestamp         thrpt           2302.986          ops/s

So I re-implemented the time range sharding algorithm based on time interval to improve the ease of use and performance.

https://github.com/Ahoo-Wang/CosId/releases/tag/v1.4.5

CosIdIntervalShardingAlgorithm

  • DateIntervalShardingAlgorithm
    • type: COSID_INTERVAL_DATE
  • LocalDateTimeIntervalShardingAlgorithm
    • type: COSID_INTERVAL_LDT
  • TimestampIntervalShardingAlgorithm
    • type: COSID_INTERVAL_TS
  • TimestampOfSecondIntervalShardingAlgorithm
    • type: COSID_INTERVAL_TS_SECOND
  • SnowflakeIntervalShardingAlgorithm
    • type: COSID_INTERVAL_SNOWFLAKE

Is your feature request related to a problem?

NO

Describe the feature you would like.

If you think this implementation is good, I can submit a PR

Thank you for the feature request and COSID project, this is a good open source project.
I just need more information before we do it.

  1. What is the dependencies of COSID, how to handle if guava's version conflict with ShardingSphere?
  2. Is the COSID must be number type in database column, how about date or varchar type?
  3. Is it passible to merge COSID_INTERVAL_DATE, COSID_INTERVAL_LDT , COSID_INTERVAL_TS and COSID_INTERVAL_TS_SECOND together? How about use properties key to distinguish them?
  4. What is the usage of COSID_INTERVAL_SNOWFLAKE, why re-implement SNOWFLAKE again?

Thank you very much for your approval and reply.


  1. CosId-Core has no dependencies(I can remove Guava dependency Or keep the guava version consistent with ShardingSphere),But in order to solve the SnowflakeId machineId allocation problem, CosId-Redis needs to depend on Redis(io.lettuce:lettuce-core). For the id segment mode, the ID segment distribution problem, you need to depend on Jdbc(java.sql.*) or Redis(io.lettuce:lettuce-core).
  2. Yes,The distributed ID provided by CosId only supports returning long(Because we will not choose Date or varchar as the primary key). https://github.com/Ahoo-Wang/CosId/blob/main/cosid-core/src/main/java/me/ahoo/cosid/IdGenerator.java
  3. Yes,I can merge COSID_INTERVAL_DATE, COSID_INTERVAL_LDT , COSID_INTERVAL_TS and COSID_INTERVAL_TS_SECOND together.We don’t need to distinguish between them, I will parse it by checking the shard value type.https://github.com/Ahoo-Wang/CosId/releases/tag/v1.4.6
  4. CosId implements two types of distributed Id: SnowflakeId and SegmentId.
    But we know that only the algorithm provided by SnowflakeId is completely insufficient. For example, ShardingSphere did not consider the issue of machineId allocation when implementing SnowflakeId (ShardingSphere provides manual allocation, but this is in flexible deployment The process is shown to be inefficient), and CosId provides MachineIdDistributor to solve this problem. Of course there are other features.
    Here is a more detailed introduction, such as the optimization of the number segment mode
    • We know the partitioning method of SnowflakeId, SnowflakeId can parse out the timestamp, that is, SnowflakeId can be used as time, so SnowflakeId can be used as an INTERVAL allocation algorithm. (When there is no CreateTime available shards [this is a very extreme situation], or when there is a very extreme requirement for performance, the distributed ID primary key as the query range may be a better choice for the performance of the persistence layer.)

Thank you very much for your approval and reply.

  1. CosId-Core has no dependencies(I can remove Guava dependency Or keep the guava version consistent with ShardingSphere),But in order to solve the SnowflakeId machineId allocation problem, CosId-Redis needs to depend on Redis(io.lettuce:lettuce-core). For the id segment mode, the ID segment distribution problem, you need to depend on Jdbc(java.sql.*) or Redis(io.lettuce:lettuce-core).

Redis is not reg-center component of ShardingSphere, could you consider about use ZooKeeper or Etcd as cluster mode of ShardingSphere? Maybe we need to integrate with ShardingSphere deeply.

  1. Yes,The distributed ID provided by CosId only supports returning long(Because we will not choose Date or varchar as the primary key). https://github.com/Ahoo-Wang/CosId/blob/main/cosid-core/src/main/java/me/ahoo/cosid/IdGenerator.java

OK

  1. Yes,I can merge COSID_INTERVAL_DATE, COSID_INTERVAL_LDT , COSID_INTERVAL_TS and COSID_INTERVAL_TS_SECOND together.We don’t need to distinguish between them, I will parse it by checking the shard value type.https://github.com/Ahoo-Wang/CosId/releases/tag/v1.4.6

How to process use real timestamp as business column value, just use original interval timestamp?

  1. CosId implements two types of distributed Id: SnowflakeId and SegmentId.
    But we know that only the algorithm provided by SnowflakeId is completely insufficient. For example, ShardingSphere did not consider the issue of machineId allocation when implementing SnowflakeId (ShardingSphere provides manual allocation, but this is in flexible deployment The process is shown to be inefficient), and CosId provides MachineIdDistributor to solve this problem. Of course there are other features.
    Here is a more detailed introduction, such as the optimization of the number segment mode

If COSID's snowflake algorithm is good enough, ShardingSphere can use it to instead of the original one, but it is better to keep type of algorithm as SNOWFLAKE, it is fine if add new types of COSID_SEGMENT and COSID_SEGMENT_CHAIN`.

  • We know the partitioning method of SnowflakeId, SnowflakeId can parse out the timestamp, that is, SnowflakeId can be used as time, so SnowflakeId can be used as an INTERVAL allocation algorithm. (When there is no CreateTime available shards [this is a very extreme situation], or when there is a very extreme requirement for performance, the distributed ID primary key as the query range may be a better choice for the performance of the persistence layer.)

Yes. totally agree.

Because of original snowflake algorithm had implemented by ShardingSphere already, so it is better to keep name consist, for new key generators and sharding algorithms, we can introduce brand of COSID.

The summaries are:

  1. Add new 2 key generators: COSID_SEGMENT and COSID_SEGMENT_CHAIN, and update original SNOWFLAKE`
  2. Add sharding algorithms with SNOWFLAKE, and COSID_TIME_INTERVAL.
  3. Integrate COSID_SEGMENT and COSID_SEGMENT_CHAIN with ShardingSphere deeply, just reuse reg-center of cluster mode.
  1. Add new 2 key generators: COSID_SEGMENT and COSID_SEGMENT_CHAIN, and update original SNOWFLAKE`

CosId provides a unified interface IdGeneratorProvider without the need to specify the specific implementation algorithm is SnowflakeId, SegmentId or IdSegmentChain. That is, the user does not need to specify TYPE as any specific algorithm. It may be better to define it as COSID. Pass in the parameter id-name(Properties) to get the specific algorithm from IdGeneratorProvider.

If COSID's snowflake algorithm is good enough, ShardingSphere can use it to instead of the original one, but it is better to keep type of algorithm as SNOWFLAKE.(and update original SNOWFLAKE`)

OK.


  1. Add sharding algorithms with SNOWFLAKE, and COSID_TIME_INTERVAL.

OK.


  1. Integrate COSID_SEGMENT and COSID_SEGMENT_CHAIN with ShardingSphere deeply, just reuse reg-center of cluster mode.

Redis is not reg-center component of ShardingSphere, could you consider about use ZooKeeper or Etcd as cluster mode of ShardingSphere? Maybe we need to integrate with ShardingSphere deeply.

OK, I will consider using Zookeeper to implement MachineIdDistributor of SnowflakeId and IdSegmentDistributor of SegmentId.


How to process use real timestamp as business column value, just use original interval timestamp?

I’m not quite sure what you mean by real timestamp as business column value and original interval timestamp, what is the difference, could you elaborate more?

CosId provides a unified interface IdGeneratorProvider without the need to specify the specific implementation algorithm is SnowflakeId, SegmentId or IdSegmentChain. That is, the user does not need to specify TYPE as any specific algorithm. It may be better to define it as COSID. Pass in the parameter id-name(Properties) to get the specific algorithm from IdGeneratorProvider.

SNOWFLAKE is the special one, lots of users know the algorithm. It is not good idea for change the old configuration with new type.

I’m not quite sure what you mean by real timestamp as business column value and original interval timestamp, what is the difference, could you elaborate more?

For example, user just want use format yyyy-MM-dd to persist the data.

SNOWFLAKE is the special one, lots of users know the algorithm. It is not good idea for change the old configuration with new type.

agree.

I’m not quite sure what you mean by real timestamp as business column value and original interval timestamp, what is the difference, could you elaborate more?

For example, user just want use format yyyy-MM-dd to persist the data.

You mean Java.sql.Date type?
If so, the DateIntervalShardingAlgorithm is also supported.
If you mean string type, it is not supported yet, but I can handle it.

Great, everything has came to an agreement each other.
We can discuss coding design by pull request soon.

Nice!!! Thank you very much for your patience and suggestions.
I will finish partial coding work and submit PR within this week.

@Ahoo-Wang we need to integrate ShardingSphere's cluster mode to get instanceId instead of work-id, and I have opened this issue #14254 to create instanceId for 3 modes of ShardingSphere, therefore, I will reopen the issue until we finish all the work.

@menghaoranss ok, that's right.