milvus-io/milvus-sdk-java

GetCollStatResponseWrapper randomly returns 0 size for collections in 2.3.x

Opened this issue · 4 comments

Hi,

The GetCollStatResponseWrapper randomly returns a zero row count for some collections. For others it still works ok, so it's unclear what the reason is.

For example, here is the collection in a format compatible with LangChain:

{'collection_name': 'test',
 'auto_id': False,
 'num_shards': 1,
 'description': '',
 'fields': [{'field_id': 100,
   'name': 'id',
   'description': '',
   'type': <DataType.VARCHAR: 21>,
   'params': {'max_length': 36},
   'is_primary': True},
  {'field_id': 101,
   'name': 'text',
   'description': '',
   'type': <DataType.VARCHAR: 21>,
   'params': {'max_length': 65535}},
  {'field_id': 102,
   'name': 'metadata',
   'description': '',
   'type': <DataType.JSON: 23>,
   'params': {}},
  {'field_id': 103,
   'name': 'vector',
   'description': '',
   'type': <DataType.FLOAT_VECTOR: 101>,
   'params': {'dim': 768}}],
 'aliases': [],
 'collection_id': 451819797554279738,
 'consistency_level': 0,
 'properties': {},
 'num_partitions': 1,
 'enable_dynamic_field': True}

The real row count:

[{'count(*)': 27}]

The Java code that returns 0:

R<GetCollectionStatisticsResponse> respCollectionStatistics = milvusClient.getCollectionStatistics(
    GetCollectionStatisticsParam.newBuilder()
      .withCollectionName(name)
      .build()
    );
GetCollStatResponseWrapper wrapperCollectionStatistics = new GetCollStatResponseWrapper(respCollectionStatistics.getData());
System.out.println(wrapperCollectionStatistics.getRowCount());

0

I use SDK 2.3.4 which is tied to LangChain4J.

I tried to debug it further, and now I have two identical collections of size 27 (with different names), but wrapperCollectionStatistics returns 0 for one and the correct 27 for the other.

yhmo commented

The function of MilvusClient.getCollectionStatistics() in Java SDK is equal to the Collection.num_entities in Milvus Python SDK. This API returns a raw number of entities. It gets the number from Etcd by summing up row numbers of all sealed segments.

As we know, when users call insert() to insert entities into a collection, the insert request is passed to Pulsar, and consumed by querynode/datanode asynchronously. The datanode accumulates entities in a memory buffer, once the buffer size exceeds a threshold, the datanode flushes the buffer to be a sealed segment. Only when a sealed segment is persisted, its row number is recorded into Etcd.

So, the number returns from MilvusClient.getCollectionStatistics() is not accurate.
To get an accurate number, use "count(*)".

This is an example of MilvusClientV2 to get row number:
It is a query request. Use the ConsistencyLevel to control the data visibility. "ConsistencyLevel.STRONG" means this query will wait until all data is consumed by querynode.
Note: the data in pulsar cannot be queried.

        QueryResp queryResp = client.query(QueryReq.builder()
                .collectionName(collectionName)
                .filter("")
                .outputFields(Collections.singletonList("count(*)"))
                .consistencyLevel(ConsistencyLevel.STRONG)
                .build());
        List<QueryResp.QueryResult> queryResults = queryResp.getQueryResults();
        return (long)queryResults.get(0).getEntity().get("count(*)");

Thank you, @yhmo. We’ll proceed with this approach.

Could you also let me know if there are any plans to deprecate MilvusClient.getCollectionStatistics()?

yhmo commented

getCollectionStatistics() is much faster than query("count(*)") because getCollectionStatistics() quickly picks the number from Etcd but query() requires the collection to be loaded and iterates all the segments to sum up the number. Sometimes users only want to know a raw number and don't intend to load the collection. So I think the getCollectionStatistics() should not be marked as deprecated.

In the python sdk, the Collection.num_entities is not deprecated either:
https://github.com/milvus-io/pymilvus/blob/master/pymilvus/orm/collection.py#L265