在search后再insert向量，之后下一次search性能有较大影响

Question

在search后再insert向量，之后下一次search性能有较大影响

HelWireless opened this issue 4 years ago · 4 comments

Describe the bug
A clear and concise description of what the bug is.

Steps/Code to reproduce behavior
Follow this guide to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details

Hardware/Softward conditions (OS, CPU, GPU, Memory)
- centos7 , GPU Tesla V100 Memory 16g
Method of installation (Docker, or from source)
- Docker
Milvus version (v0.3.1, or v0.4.0)
- Milvus 0.7.0, 0.8.0
Milvus configuration (Settings you made in server_config.yaml)
- engine_config:
  - use_blas_threshold: 1100
  - gpu_search_threshold: 1000
- cache_config:
  - cpu_cache_capacity: 20
  - insert_buffer_size: 14
  - cache_insert_data: true
- db_config:
  - preload_collection: '*'
  - auto_flush_interval: 1
- collection info
  - qa_title_vec CollectionInfo(count: 266776, partitions_stat: [PartitionStat(tag: '_default', count: 266776, segments_stat: [SegmentStat(segment_name: '1587548082004630000', count: 266776, index_name: 'IVFSQ8H', data_size: 322265408)])]))
  - qa_doc_sim CollectionInfo(count: 267294, partitions_stat: [PartitionStat(tag: '_default', count: 267294, segments_stat: [SegmentStat(segment_name: '1587542371490349000', count: 267294, index_name: 'IVFSQ8', data_size: 322891152)])]))

Additional context
我有一个新的向量，在我先从collection中搜索topn相似向量操作后，再把这个新向量插入collection中，然后在短暂一段时间后，下一个向量来了，此时在collection中搜索topn相似向量操作，性能将会有很大影响，经测试差距约17倍。在0.7.0上和0.8.0版本都有发现,并且尝试了IVFSQ8H，和IVFSQ8 两种索引。

**MY code **

from milvus import Milvus, IndexType, MetricType
import time 

MILVUS_URI = 'tcp://192.168.0.21:19530'
MILVUS_NPORBE = {"nprobe":1124} 
MILVUS_TOP_K = 1000

class MilvusTools(object):
    
    def __init__(self):
        self.URI = MILVUS_URI
        self.nprobe = MILVUS_NPORBE
        self.tok_k = MILVUS_TOP_K
        self.milvus = self._conn()

    def _conn(self):
        """
            connect to the milvus and return the object milvus
        :return:
        """
        milvus_conn = Milvus()
        milvus_conn.connect(uri=self.URI)
        return milvus_conn
    
    def get_sim_item_ids(self, collection_name, vec, ids, if_new):
        t1 = time.time()
        Status, milvus_res = self.milvus.search(collection_name=collection_name,
                      query_records=[vec], top_k=self.tok_k, params=self.nprobe)
        print("+"*60, "\n", Status,"\n","+"*60)
        t2 = time.time()
        if Status.code==0:
            sim_article_ids = milvus_res.id_array[0]
            sim_article_scores = milvus_res.distance_array[0]
            if id in sim_article_ids: # remove the article itself
                index = sim_article_ids.index(id)
                sim_article_ids.pop(index)
                sim_article_scores.pop(index)
        if if_new:
############# if it is a new vec ,insert to the clloletion  ########
            self.milvus.insert(collection_name, records=[vec], ids=[ids])

        t3 = time.time()
        print("search time is ",t2-t1,"\n insert time is ",t3-t2)
        return milvus_res
    
    def _close(self):
        self.milvus.disconnect()


import random
# Generate 20 vectors of 300 dimension
vectors = [[random.random() for _ in range(300)] for _ in range(20)]

for ids, vector in enumerate(vectors):
    mv = MilvusTools()
    time.sleep(1)
    _ = mv.get_sim_item_ids("qa_title_vec", vector , ids=ids, if_new=True)    
    mv._close()

Screenshots

Answer 1 · 2020-04-23T09:54:30.000Z

@HelWireless Is the data in the collection randomly generated? I did not see create_index() in your jupyter notebook. Could you please provide more information to help up reproduce the problem?

Answer 2 · 2020-04-24T02:44:54.000Z

@HelWireless Is the data in the collection randomly generated? I did not see create_index() in your jupyter notebook. Could you please provide more information to help up reproduce the problem?

collection是之前create的,都是真实的数据，不是随机创建的，并且有创建index，
创建collection code 如下：

from milvus import Milvus, IndexType, MetricType
milvus = Milvus()
URI = 'tcp://192.168.0.21:19530'
milvus.connect(uri =URI )

title_param = {"collection_name":"qa_title_vec","dimension":300,"metric_type":MetricType.IP}
doc_param = {"collection_name":"qa_doc_sim","dimension":300,"metric_type":MetricType.IP}
milvus.create_collection(title_param)
milvus.create_collection(doc_param)



IVF_SQ8_param = {'nlist': 6384}
IVF_SQ8H_param = {'nlist': 6384}

milvus.create_index('qa_doc_sim', IndexType.IVF_SQ8, IVF_SQ8_param)
milvus.create_index('qa_title_vec', IndexType.IVF_SQ8H, IVF_SQ8H_param)

milvus.describe_collection("qa_title_vec")

collection信息补充如下:

qa_doc_sim CollectionInfo(count: 267294, partitions_stat: [PartitionStat(tag: '_default', count: 267294, segments_stat: [SegmentStat(segment_name: '1587542371490349000', count: 267294, index_name: 'IVFSQ8', data_size: 322891152)])]))
{count:267294,
dimension:300,
index:IVFSQ8,
index_file_size:1024,
metric_type:IP,
nlist:6384}
qa_title_vec CollectionInfo(count: 266776, partitions_stat: [PartitionStat(tag: '_default', count: 266776, segments_stat: [SegmentStat(segment_name: '1587548082004630000', count: 266776, index_name: 'IVFSQ8H', data_size: 322265408)])]))
{count:266776,
dimension:300,
index:IVFSQ8H,
index_file_size:1024,
metric_type:IP,
nlist:6384}

Answer 3 · 2020-06-15T09:09:08.000Z

Do you retest on the 0.9.0?

Answer 4 · 2020-06-24T02:23:32.000Z

Do you retest on the 0.9.0?
no，I've solved that problem --Separated reading and writing.