AmenRa/retriv

[Feature Request] build index on a sequence of json/jsonl files

MarshtompCS opened this issue · 3 comments

When a corpus contains a very large quantity of documents, we usually split it into multiple files. I wonder if it is possible to support inputing a sequence of json/jsonl files to build index.

I have implemented this by passing a generator that iterates over multiple files for building index

AmenRa commented

Would you mind posting your solution for other people?
Thank you.

Would you mind posting your solution for other people? Thank you.

Sure! The solution is like:

  1. Define a iterator over multiple files
def many_files_line_iterator(files_list, callback=None):
    for file in files_list:
        open(file, "r") as fn:
            for line in fn.readlines():
                line = json.loads(line)
                if callback:
                    yield callback(line)
                else:
                    yield line

  1. pass this iterator as collection
files_list = ["path_to_jsonl_0", "path_to_jsonl_1"]
search_engine.index(many_files_line_iterator(files_list, callback=None))