milvus-io/milvus

在search后再insert向量,之后下一次search性能有较大影响

HelWireless opened this issue · 4 comments

Describe the bug
A clear and concise description of what the bug is.

Steps/Code to reproduce behavior
Follow this guide to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details

  • Hardware/Softward conditions (OS, CPU, GPU, Memory)
    • centos7 , GPU Tesla V100 Memory 16g
  • Method of installation (Docker, or from source)
    • Docker
  • Milvus version (v0.3.1, or v0.4.0)
    • Milvus 0.7.0, 0.8.0
  • Milvus configuration (Settings you made in server_config.yaml)
    • engine_config:
      • use_blas_threshold: 1100
      • gpu_search_threshold: 1000
    • cache_config:
      • cpu_cache_capacity: 20
      • insert_buffer_size: 14
      • cache_insert_data: true
    • db_config:
      • preload_collection: '*'
      • auto_flush_interval: 1
    • collection info
      • qa_title_vec CollectionInfo(count: 266776, partitions_stat: [PartitionStat(tag: '_default', count: 266776, segments_stat: [SegmentStat(segment_name: '1587548082004630000', count: 266776, index_name: 'IVFSQ8H', data_size: 322265408)])]))
      • qa_doc_sim CollectionInfo(count: 267294, partitions_stat: [PartitionStat(tag: '_default', count: 267294, segments_stat: [SegmentStat(segment_name: '1587542371490349000', count: 267294, index_name: 'IVFSQ8', data_size: 322891152)])]))

Additional context
我有一个新的向量,在我先从collection中搜索topn相似向量操作后,再把这个新向量插入collection中,然后在短暂一段时间后,下一个向量来了,此时在collection中搜索topn相似向量操作,性能将会有很大影响,经测试差距约17倍。在0.7.0上和0.8.0版本都有发现,并且尝试了IVFSQ8H,和IVFSQ8 两种索引。

**MY code **

from milvus import Milvus, IndexType, MetricType
import time 

MILVUS_URI = 'tcp://192.168.0.21:19530'
MILVUS_NPORBE = {"nprobe":1124} 
MILVUS_TOP_K = 1000

class MilvusTools(object):
    
    def __init__(self):
        self.URI = MILVUS_URI
        self.nprobe = MILVUS_NPORBE
        self.tok_k = MILVUS_TOP_K
        self.milvus = self._conn()

    def _conn(self):
        """
            connect to the milvus and return the object milvus
        :return:
        """
        milvus_conn = Milvus()
        milvus_conn.connect(uri=self.URI)
        return milvus_conn
    
    def get_sim_item_ids(self, collection_name, vec, ids, if_new):
        t1 = time.time()
        Status, milvus_res = self.milvus.search(collection_name=collection_name,
                      query_records=[vec], top_k=self.tok_k, params=self.nprobe)
        print("+"*60, "\n", Status,"\n","+"*60)
        t2 = time.time()
        if Status.code==0:
            sim_article_ids = milvus_res.id_array[0]
            sim_article_scores = milvus_res.distance_array[0]
            if id in sim_article_ids: # remove the article itself
                index = sim_article_ids.index(id)
                sim_article_ids.pop(index)
                sim_article_scores.pop(index)
        if if_new:
############# if it is a new vec ,insert to the clloletion  ########
            self.milvus.insert(collection_name, records=[vec], ids=[ids])

        t3 = time.time()
        print("search time is ",t2-t1,"\n insert time is ",t3-t2)
        return milvus_res
    
    def _close(self):
        self.milvus.disconnect()


import random
# Generate 20 vectors of 300 dimension
vectors = [[random.random() for _ in range(300)] for _ in range(20)]

for ids, vector in enumerate(vectors):
    mv = MilvusTools()
    time.sleep(1)
    _ = mv.get_sim_item_ids("qa_title_vec", vector , ids=ids, if_new=True)    
    mv._close()

Screenshots
image

image

@HelWireless Is the data in the collection randomly generated? I did not see create_index() in your jupyter notebook. Could you please provide more information to help up reproduce the problem?

@HelWireless Is the data in the collection randomly generated? I did not see create_index() in your jupyter notebook. Could you please provide more information to help up reproduce the problem?

collection是之前create的,都是真实的数据,不是随机创建的,并且有创建index,
创建collection code 如下:

from milvus import Milvus, IndexType, MetricType
milvus = Milvus()
URI = 'tcp://192.168.0.21:19530'
milvus.connect(uri =URI )

title_param = {"collection_name":"qa_title_vec","dimension":300,"metric_type":MetricType.IP}
doc_param = {"collection_name":"qa_doc_sim","dimension":300,"metric_type":MetricType.IP}
milvus.create_collection(title_param)
milvus.create_collection(doc_param)



IVF_SQ8_param = {'nlist': 6384}
IVF_SQ8H_param = {'nlist': 6384}

milvus.create_index('qa_doc_sim', IndexType.IVF_SQ8, IVF_SQ8_param)
milvus.create_index('qa_title_vec', IndexType.IVF_SQ8H, IVF_SQ8H_param)

milvus.describe_collection("qa_title_vec")

collection信息补充如下:

  • qa_doc_sim CollectionInfo(count: 267294, partitions_stat: [PartitionStat(tag: '_default', count: 267294, segments_stat: [SegmentStat(segment_name: '1587542371490349000', count: 267294, index_name: 'IVFSQ8', data_size: 322891152)])]))
    {count:267294,
    dimension:300,
    index:IVFSQ8,
    index_file_size:1024,
    metric_type:IP,
    nlist:6384}

  • qa_title_vec CollectionInfo(count: 266776, partitions_stat: [PartitionStat(tag: '_default', count: 266776, segments_stat: [SegmentStat(segment_name: '1587548082004630000', count: 266776, index_name: 'IVFSQ8H', data_size: 322265408)])]))
    {count:266776,
    dimension:300,
    index:IVFSQ8H,
    index_file_size:1024,
    metric_type:IP,
    nlist:6384}

Do you retest on the 0.9.0?

Do you retest on the 0.9.0?
no,I've solved that problem --Separated reading and writing.