Teddy-XiongGZ/MedRAG

Pre-embedded corpuses error - empty json

Closed this issue · 2 comments

Following the update that has the code download pre-embedded corpuses (great change that! ); I get an error when trying to run the README example

medrag = MedRAG(llm_name=LL_NAME, rag=True, 
                retriever_name="MedCPT",
                corpus_name="Textbooks", corpus_cache=True)

Output (error):

No sentence-transformers model found with name ncbi/MedCPT-Query-Encoder. Creating a new one with CLS pooling.
Initializing the document extracter...
  0%|                                                                                                         | 0/18 [00:00<?, ?it/s]
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File <timed exec>:1

File /mnt/d/Research2/MedRAG/src/medrag.py:84, in MedRAG.__init__(self, llm_name, rag, follow_up, retriever_name, corpus_name, db_dir, cache_dir, corpus_cache, HNSW)
     82 self.docExt = None
     83 if rag:
---> 84     self.retrieval_system = RetrievalSystem(self.retriever_name, self.corpus_name, self.db_dir, cache=corpus_cache, HNSW=HNSW)
     85 else:
     86     self.retrieval_system = None

File /mnt/d/Research2/MedRAG/src/utils.py:249, in RetrievalSystem.__init__(self, retriever_name, corpus_name, db_dir, HNSW, cache)
    247 self.cache = cache
    248 if self.cache:
--> 249     self.docExt = DocExtracter(cache=True, corpus_name=self.corpus_name, db_dir=db_dir)
    250 else:
    251     self.docExt = None

File /mnt/d/Research2/MedRAG/src/utils.py:350, in DocExtracter.__init__(self, db_dir, cache, corpus_name)
    348     continue
    349 for i, line in enumerate(open(os.path.join(self.db_dir, corpus, "chunk", fname)).read().strip().split('\n')):
--> 350     item = json.loads(line)
    351     _ = item.pop("contents", None)
    352     # assert item["id"] not in self.dict

File ~/anaconda3/envs/Medrag/lib/python3.11/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    341     s = s.decode(detect_encoding(s), 'surrogatepass')
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:
    348     cls = JSONDecoder

File ~/anaconda3/envs/Medrag/lib/python3.11/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    332 def decode(self, s, _w=WHITESPACE.match):
    333     """Return the Python representation of ``s`` (a ``str`` instance
    334     containing a JSON document).
    335 
    336     """
--> 337     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338     end = _w(s, end).end()
    339     if end != len(s):

File ~/anaconda3/envs/Medrag/lib/python3.11/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
    353     obj, end = self.scan_once(s, idx)
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The error is one resulting from an empty (json) file.
Setting HNSW=True or false, or using RRF-2 doesn't change things.

Environment: WSL2. Medcorp already downloaded (but used/cached only with BM25).

The issue may be specific to "TextBooks". (RAG run ok with StatPearls; fails with corpus_name= "MedText" or "TextBooks").

It looks like the issue raised by the absence of git-lfs. Is git-lfs installed on your machine when downloading the chunks?