Parsing a json variable in memory instead of getting it from file

Question

Parsing a json variable in memory instead of getting it from file

arcontechnologies opened this issue 3 months ago · 4 comments

arcontechnologies commented 3 months ago

Hi,
As far as I can see there is no way to parse a json directly from variable in memory. am I wrong ?
any example out there ?
any insight ?

Answer 1 · 2024-07-06T15:23:36.000Z

@arcontechnologies - there are several entry points into the Parser. The recommended (easiest and most scalable) approach is to use library.add_files() which takes care of collating by type, writing to DB directly and organizing all of the assets into a library collection. Alternatively, you can call the parsers directly and parse in memory (output of a list of dictionaries - which could then be pushed into a db separately) ...

JSON/JSONL files - by default, the parser will look only for a "text" attribute and output that attribute as the core content. If you want to configure different behavior, then look at the parse_json_config method.

You can call:

output = Parser().parse_text("/path/to/files", write_to_db=False)

output = Parser().parse_one_text("/path/to_files", "my_file.jsonl")

output = Parser().parse_json_config -> see the example for more details to build a configured mapping

output = TextParser().jsonl_file_handler("/path/to_files", "my_file.jsonl")

You can similarly parse and (keep in memory) a single PDF, DOCX, PPTX, XLSX, WAV, TXT, CSV, TSV, etc.

Let me know if you have a specific use case in mind....

Answer 2 · 2024-07-06T19:14:29.000Z

@doberst Actually, my use case is the following :

I'm classifying emails (with spacy classifier) in outlook and storing them into Meilisearch (an Elastic Search looks like)
Based on your youtube video regarding ReRanker I'm doing the following :
search for emails according to specific query and returning a json variable as response
That where my question is pointing out : The json variable is in memory and I was expecting to parse this variable directly in memory and from there use the rest of your script to pass the ReRanker output to the llm.

Seems that I have to use files to get things done. Am I wrong ?

Answer 3 · 2024-07-08T15:04:31.000Z

@arcontechnologies - based on what you describe, I think you can skip the parsing step altogether - provided that your json file loads in memory as a list of dictionaries (e.g., json.load(open("my_json_file.json", "r")), and that each dictionary in the list has a key called "text", then just pass that json output in memory directly to the reranker inference and it will automatically create 'text chunks' from the text found in the "text" key.

Answer 4 · 2024-07-08T15:18:05.000Z

@doberst Thanks for your feedback 👍