[Feature Request] build index on a sequence of json/jsonl files

Question

[Feature Request] build index on a sequence of json/jsonl files

MarshtompCS opened this issue a year ago · 3 comments

When a corpus contains a very large quantity of documents, we usually split it into multiple files. I wonder if it is possible to support inputing a sequence of json/jsonl files to build index.

Answer 1 · 2023-10-07T22:29:15.000Z

I have implemented this by passing a generator that iterates over multiple files for building index

Answer 2 · 2023-10-09T06:46:23.000Z

Would you mind posting your solution for other people?
Thank you.

Answer 3 · 2023-10-09T14:06:55.000Z

Would you mind posting your solution for other people? Thank you.

Sure! The solution is like:

Define a iterator over multiple files

def many_files_line_iterator(files_list, callback=None):
    for file in files_list:
        open(file, "r") as fn:
            for line in fn.readlines():
                line = json.loads(line)
                if callback:
                    yield callback(line)
                else:
                    yield line

pass this iterator as collection

files_list = ["path_to_jsonl_0", "path_to_jsonl_1"]
search_engine.index(many_files_line_iterator(files_list, callback=None))