[Feature Request] build index on a sequence of json/jsonl files
MarshtompCS opened this issue · 3 comments
MarshtompCS commented
When a corpus contains a very large quantity of documents, we usually split it into multiple files. I wonder if it is possible to support inputing a sequence of json/jsonl files to build index.
MarshtompCS commented
I have implemented this by passing a generator that iterates over multiple files for building index
AmenRa commented
Would you mind posting your solution for other people?
Thank you.
MarshtompCS commented
Would you mind posting your solution for other people? Thank you.
Sure! The solution is like:
- Define a iterator over multiple files
def many_files_line_iterator(files_list, callback=None):
for file in files_list:
open(file, "r") as fn:
for line in fn.readlines():
line = json.loads(line)
if callback:
yield callback(line)
else:
yield line
- pass this iterator as
collection
files_list = ["path_to_jsonl_0", "path_to_jsonl_1"]
search_engine.index(many_files_line_iterator(files_list, callback=None))