yorkie-team/codepair

Add Semantic Search Feature

Opened this issue ยท 4 comments

What would you like to be added:

I propose to add a Semantic Search feature that enhances the ability to search and retrieve documents semantically. This functionality could be beneficial for users looking to improve the relevancy of search results beyond traditional keyword matching. The conceptual architecture and workflow are illustrated in the images included.

Key Decisions Needed:

  1. When to save/update documents in the Vector Store?

    • Options:
      • Every time a document is updated
      • Periodically through a Cron Job
      • After a set duration without updates (e.g., 10 minutes)
      • Initially embed large documents, then embed smaller updates, with periodic consolidation.
  2. How to store existing data in the Vector Store during feature deployment?

  3. Chunking Strategy:

    • Different chunking methods have advantages and disadvantages, including:
      • Parent-Child Chunking
      • Fixed Chunking
      • Other strategies
  4. Embedding Model:

    • What model should we use for embedding?
    • It may be costly to rely on commercial models like OpenAI due to frequent embedding needs.
    • Exploring options like Ollama or smaller models could be sufficient.
  5. Vector Store Considerations:

    • Recommendations for potential Vector Stores:
      • Milvus (29k)
      • Weviate (10k)
      • Chroma (14k)
      • Faiss (30k)
    • Need for features like Namespace to support separation by Workspace for better data management.

Why is this needed:

Integrating a Semantic Search feature will significantly enhance user experience by providing more relevant and efficient search capabilities.

Additional Information:

  • Relevant references must be gathered for informed decision-making.

image

This feature can be useful for resolving this issue: yorkie-team/yorkie#1002

@sihyeong671 Could you check this comment(yorkie-team/yorkie#1002 (comment))? Doc event webhook is useful to implement semantic search.