milvus-io/milvus-sdk-java

Unable to page query full data in the collection

Closed this issue · 6 comments

There is approximately 20w to 30w of data in the partition of the collection, which cannot be fully queried using offset and limit. How should I solve this problem? I am using Java jdk2.2.9 version. Thank you.

Unfortunately 2.2 can't do this. you have to split PK range and do multiple query.
2.3 support query iterator and you can iterate the data out

Hello, I would like to ask. You said that 2.3 supports query iterators, and you can iterate out data. Is there a demo on this official website?

Hello, I would like to ask. You said that 2.3 supports query iterators, and you can iterate out data. Is there a demo on this official website?

https://milvus.io/docs/with_iterators.md#Query-with-iterator

yhmo commented

@duanwenL
The iterator feature is implemented in python sdk, it is a client-side implementation. Currently, Java sdk has no this feature(neither 2.2.13 nor 2.3.1).

There is a hidden configuration for the limitation in the server side, just add this value in the milvus.yaml:

quotaAndLimits:
  limits:
    maxQueryResultWindow: 500000

Restart the service then you can do query with limit=500000. But eventually it is limited by rpc transfer limitation.
I suggest do query batch by batch with the following steps:

  1. call query() with expression="id >= 0" to return all the primary keys, only fetch the primary key in this step, the size of the returned result should be under rpc transfer limitation
  2. now you have all the primary keys, you can call query() for multiple times to fetch vectors/scalars batch by batch. For example, call query() with expression = "id in [1, 2, 3, 4, 5]", then call query() with "id in [6, 7, 8, 9, 10]"......

@yhmo
hello,Let me first explain the background of using Milvus! Background: We store all articles and article vectors in Milvus, but the article vectors are calculated based on the characteristics of the article. The characteristics of the article may change in a few minutes. We first need to regenerate vectors based on the characteristics of the article to update the data in Milvus. Our current approach is to delete the data first and then delete it. Another aspect is to delete the data in the partition. Our article also has an expiration limit. If it is not updated for more than a few days, it is considered invalid. Therefore, we search all the article data from Milvus to determine whether the article is still valid based on the last update time. Therefore, our query () method is used here to query. How can we currently find all the partition data in Milvus.I have reviewed the official Milvus documentation and currently there is no such method available. If you have any trouble, please let me know the address. Thank you

yhmo commented

Requirement:

  • 200k~300k entities in a collection, each entity represents an article
  • some articles may change in a few minutes, update the entity with new vector
  • some articles are not updated for more than a few days, find out the entities

Recommend solution:

  1. create a collection with 3 fields: id, vector, timestamp(int64)
  2. insert 200k~300k entities with their vectors and current timestamp value
  3. once an article is changed, delete the entity by its id, and insert a new entity with new vector and current timestamp value
  4. to find out which article is not changed for more than a few days, use the query with expression "timestamp < current_timestamp - x days"