xhluca/bm25s

Capability Inquiry: Retrieving Specific JSON Records Based on Text

RakshitKhajuria opened this issue ยท 4 comments

Hi I am considering using the BM25 library for a project where I need to efficiently retrieve JSON records based on textual content matches. My data is structured in JSON format, each with several fields.

Use Case

When I input a query, such as "mountain cycling", I want to retrieve the top K JSON records that best match this query based on the content of the 'chunk' field.

Example of json

    {
        "chunk_id": 1,
        "chunk": "mountain cycling",
        "vocabulary_id": "SPORTS001",
        "vocabulary_name": "Global Sports Vocabulary",
        "concept_code": "MTCYCL001",
        "concept_name": "Mountain Cycling",
        "domain": "Outdoor Sports",
        "validity": true,
        "source": "Sports Encyclopedia"
    },

Questions

  1. Does the BM25 library support indexing and retrieving directly from JSON structures like the ones provided above, particularly focusing on a specific field for text matching?

  2. Setup Advice: If direct JSON handling is supported, could you provide guidance or documentation on how to set up the library for this specific use case?

Am i doing this correctly

corpus_tokens = bm25s.tokenize([item['chunk'] for item in json_data])
retriever = bm25s.BM25() 
retriever.index(corpus_tokens)  

query = "mountain cycling"
query_tokens = bm25s.tokenize(query)  

# Perform the retrieval
results, scores = retriever.retrieve(query_tokens, k=100)  
print("Results:", results)
print("Results:", scores)

I was able to do it closing this.

To answer your original question, bm25s does not provide utility for indexing json files. However, the built-in json library should be good for what you have in mind.

To answer your original question, bm25s does not provide utility for indexing json files. However, the built-in json library should be good for what you have in mind.

Thank you for for replying. I was able to get the results. ๐Ÿ˜Š