infiniflow/infinity

ROADMAP 2024

writinwaters opened this issue ยท 10 comments

v0.5.0 (Planning)

Core:

  • Supports Product quantization.
  • Supports Scalar quantization (u8).
  • Supports Binary vector with Hamming distance.
  • Supports DiskANN index. #1953
  • Supports Cluster management, log replication, and fail over
  • Supports result caching and paginate function #1903
  • Supports regular expression on varchar field. #1986
  • Supports analyzer from RAGFlow #1973
  • Supports authentication with default roles.
  • Supports system level data backup and recovery
  • Support providing comment when create database / index / table

Integration

  • Integrates with RAGFlow #1405

Tools

v0.4.0

Core:

  • Enable IVF Index. #1917
  • Supports filter expression to match_dense/match_sparse/match_text/fusion respectively. #1803
  • Supports using full text as filter. #1803
  • Supports MinShouldMatch as a full text filter. #1862
  • Supports date and time type. #1804 #1824
  • Supports IN operator in filter expression. #1839
  • Supports lock/unlock table to prevent manipulating. #1813
  • Add/Remove column by locking table.
  • Supports Korean for full text search. #1228
  • Supports highlighter on full text search #1861
  • Suports to add column comment when creating table. #2038

Integration

  • Supports S3 storage. #1809
  • Refactor file IO part to integrates S3, NFS, and local filesystem.

API

  • Create an specific embedded Infinity python module. #1786
  • Support orderby / sort function. #1944

Tools

  • GUI, list databases / tables / show variables / show configs #1841

v0.3.0

Core:

  • Virtual file system. #1184
  • Memory optimization of sparse index building. #1436
  • Support unordered sparse embedding index when importing data. #1419
  • Unify SIMD operations together. #1473
  • Supports bf16 embedding data type. #1579
  • Supports f16 embedding data type. #1579
  • Supports int8 embedding data type. #1527
  • Supports multiple vectors on one document. #1679
  • Smart full-text query syntax. #1622
  • Use full-checkpoint, export and import parquet file to support data backup and restore.
  • Supports export parquet files. #1330
  • Supports import parquet files. #1330

v0.2.0

  • Supports sparse vector index. #1174
  • Supports tensor data type. #1179
  • Supports cosine similarity. #1176
  • Supports configurable reciprocal rank fusion operator. #1177
  • Multiple recall supports more than two ways. #1178
  • HTTP API: Supports GET/SET variables. #1180
  • Embedded infinity. #1181
  • Export data into CSV and JSONL file type. #1175
  • Unify the background computation task running into task executor. #1182
  • Integrates later interaction models, such as Colbert. #1279
  • Support building secondary index on string columns. #1235
  • Support Japanese for full text search. #1137
  • Support traditional Chinese for full text search. #1376
  • Support near by query for full text search. #1346

v0.1.0

  • Building HNSW index in parallel. #341
  • Supports aggregate operation. #357
  • Supports order by (sort) operation. #339
  • Supports limit operation. #362
  • Supports order by + limit as top operation. #408
  • Secondary index on structured data type. #360
  • New full text search. #358
  • Minmax of column data. #448
  • Bloomfilter of structured data column. #467
  • Refactor ColumnVector: Reduce serialization times as much as possible. #449
  • Supports new data type: date. #371
  • Supports new data type: bool. #394
  • Refactor meta data: Provides a clear interface to access meta data, instead of traversing meta data tree. #368
  • Refactor error handling: Provides normalized error code and error message. #439
  • Segment GC and segment compaction. #466
  • Refactor WAL with physical log, instead of logical log. #431
  • Asynchronous index building: Data become query-able once imported / inserted.
  • Storage clean up: Deprecated index/segment/catalog ... files need to be clean up to save the disk space. #635
  • Incremental checkpoint. #438
  • New python API to show database system value. #495
  • New python API to explain the query plan. #496
  • HTTP API #779

Backlog

Core

  • Supports Aarch64.
  • Native supports MacOS(m1) and Windows
  • User management

Integration

  • Supports NFS.
  • Integrates with Langchain
  • Integrates with llamdindex
  • Embedding function.

Tools

  • Infinity database backup and restore tools. #1183
  • Monitoring tools.
  • Data migration tool.

CI improvement: post logs of infinity when CI failure, use Ubuntu 20.04 as base of dev image.
Fuzz test of infinity.

Secordary index on structured data type.
--->
Secondary index on structured data types.

Here is a mis-spelling error.

  • Secondary

Fixed and thank you.

compatibility testing

image tag refer
centos 7 8 https://hub.docker.com/_/centos/
ubuntu 20.04 22.04 24.04 https://hub.docker.com/_/ubuntu https://releases.ubuntu.com/
debian 8 9 10 11 12 https://hub.docker.com/_/debian https://www.debian.org/releases/
opensuse/leap 15.0 15.1 15.2 15.3 15.4 15.5 https://hub.docker.com/r/opensuse/leap
openeuler/openeuler 20.03   22.03 https://hub.docker.com/r/openeuler/openeuler
openanolis/anolisos 8.6 23 https://hub.docker.com/r/openanolis/anolisos
openkylin/openkylin 1.0 https://hub.docker.com/r/openkylin/openkylin

I would like to contribute to this project, which issue would be a good start?

@Kelvinyu1117
We do have a couple of issues that might work for contributors new to this project.

  1. Add minmax information to blocks/segments in the current datastore. This information is primarily used for data filtering. (#448)
  2. Implement a bloomfilter for the blocks/segments to enhance point queries. (#467)
  3. Currently, query results are stored in memory in a columnar format. However, the client expects the results in Apache Arrow format. At the moment, the format conversion is executed on the Python client, but this worsens the performance, so we plan to convert the results to Apache Arrow format on the server side before sending them to the client.
  4. There are several optimizer rules to implement, such as constant folding and simplification of arithmetic expressions, which are not yet on the roadmap. Feel free to work on them if interested.
  5. We have additional complicated tasks not listed here. For instance, the current executor operates with one thread per CPU. We're considering using coroutine to enhance efficiency, but we don't have a solid solution yet. If you have experience in this area, you are very welcome to propose your solution.
  6. We understand you're interested in contributing C++ code. However, if that's not the case, there's also unimplemented Python code, such as test cases and the Python SDK API.

Your work is exceptional! I would like to propose that, considering the current landscape, incorporating binary quantization and ColBERT-like ranking would be crucial for any vector database.
Apologies for commenting on the road map issue instead of creating a separate feature request.

Your work is exceptional! I would like to propose that, considering the current landscape, incorporating binary quantization and ColBERT-like ranking would be crucial for any vector database. Apologies for commenting on the road map issue instead of creating a separate feature request.

Nice, we will put this request into v0.2.0 release.

@JinHai-CN Hi, I have experience in developing a database using Arrow. Is the issue that converting query results to Arrow format still active? I'd like to take it.

@niebayes #1198, issue is created and we can discuss the requirement in that issue.