Pre-embedded corpuses error - empty json
Closed this issue · 2 comments
ddofer commented
Following the update that has the code download pre-embedded corpuses (great change that! ); I get an error when trying to run the README example
medrag = MedRAG(llm_name=LL_NAME, rag=True,
retriever_name="MedCPT",
corpus_name="Textbooks", corpus_cache=True)
Output (error):
No sentence-transformers model found with name ncbi/MedCPT-Query-Encoder. Creating a new one with CLS pooling.
Initializing the document extracter...
0%| | 0/18 [00:00<?, ?it/s]
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
File <timed exec>:1
File /mnt/d/Research2/MedRAG/src/medrag.py:84, in MedRAG.__init__(self, llm_name, rag, follow_up, retriever_name, corpus_name, db_dir, cache_dir, corpus_cache, HNSW)
82 self.docExt = None
83 if rag:
---> 84 self.retrieval_system = RetrievalSystem(self.retriever_name, self.corpus_name, self.db_dir, cache=corpus_cache, HNSW=HNSW)
85 else:
86 self.retrieval_system = None
File /mnt/d/Research2/MedRAG/src/utils.py:249, in RetrievalSystem.__init__(self, retriever_name, corpus_name, db_dir, HNSW, cache)
247 self.cache = cache
248 if self.cache:
--> 249 self.docExt = DocExtracter(cache=True, corpus_name=self.corpus_name, db_dir=db_dir)
250 else:
251 self.docExt = None
File /mnt/d/Research2/MedRAG/src/utils.py:350, in DocExtracter.__init__(self, db_dir, cache, corpus_name)
348 continue
349 for i, line in enumerate(open(os.path.join(self.db_dir, corpus, "chunk", fname)).read().strip().split('\n')):
--> 350 item = json.loads(line)
351 _ = item.pop("contents", None)
352 # assert item["id"] not in self.dict
File ~/anaconda3/envs/Medrag/lib/python3.11/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
341 s = s.decode(detect_encoding(s), 'surrogatepass')
343 if (cls is None and object_hook is None and
344 parse_int is None and parse_float is None and
345 parse_constant is None and object_pairs_hook is None and not kw):
--> 346 return _default_decoder.decode(s)
347 if cls is None:
348 cls = JSONDecoder
File ~/anaconda3/envs/Medrag/lib/python3.11/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
332 def decode(self, s, _w=WHITESPACE.match):
333 """Return the Python representation of ``s`` (a ``str`` instance
334 containing a JSON document).
335
336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()
339 if end != len(s):
File ~/anaconda3/envs/Medrag/lib/python3.11/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
353 obj, end = self.scan_once(s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
The error is one resulting from an empty (json) file.
Setting HNSW=True or false, or using RRF-2 doesn't change things.
Environment: WSL2. Medcorp already downloaded (but used/cached only with BM25).
ddofer commented
The issue may be specific to "TextBooks". (RAG run ok with StatPearls; fails with corpus_name= "MedText" or "TextBooks").
Teddy-XiongGZ commented
It looks like the issue raised by the absence of git-lfs
. Is git-lfs
installed on your machine when downloading the chunks?