Capability Inquiry: Retrieving Specific JSON Records Based on Text
RakshitKhajuria opened this issue ยท 4 comments
Hi I am considering using the BM25 library for a project where I need to efficiently retrieve JSON records based on textual content matches. My data is structured in JSON format, each with several fields.
Use Case
When I input a query, such as "mountain cycling", I want to retrieve the top K JSON records that best match this query based on the content of the 'chunk' field.
Example of json
{
"chunk_id": 1,
"chunk": "mountain cycling",
"vocabulary_id": "SPORTS001",
"vocabulary_name": "Global Sports Vocabulary",
"concept_code": "MTCYCL001",
"concept_name": "Mountain Cycling",
"domain": "Outdoor Sports",
"validity": true,
"source": "Sports Encyclopedia"
},
Questions
-
Does the BM25 library support indexing and retrieving directly from JSON structures like the ones provided above, particularly focusing on a specific field for text matching?
-
Setup Advice: If direct JSON handling is supported, could you provide guidance or documentation on how to set up the library for this specific use case?
Am i doing this correctly
corpus_tokens = bm25s.tokenize([item['chunk'] for item in json_data])
retriever = bm25s.BM25()
retriever.index(corpus_tokens)
query = "mountain cycling"
query_tokens = bm25s.tokenize(query)
# Perform the retrieval
results, scores = retriever.retrieve(query_tokens, k=100)
print("Results:", results)
print("Results:", scores)
I was able to do it closing this.
To answer your original question, bm25s does not provide utility for indexing json files. However, the built-in json
library should be good for what you have in mind.
To answer your original question, bm25s does not provide utility for indexing json files. However, the built-in
json
library should be good for what you have in mind.
Thank you for for replying. I was able to get the results. ๐