GoogleCloudPlatform/terraform-genai-doc-summarization

Evaluate whether to replace or refactor text truncation function

Closed this issue · 3 comments

The utils.truncate_complete_text function uses a heuristic to extract the abstract and conclusion from a OCR result. This approach has multiple issues (cannot handle corner cases; doesn't capture all or only abstracts & conclusions).

I recommend replacing or refactoring this module in one of the following ways:

  1. Replace the heuristic string manipulation code with a call to an in-memory NLP or LLM model.
  2. Investigate improvements to Document AI templating to get better results from OCR.
  3. Use regex (ugh) to better isolate the abstract and conclusion
  4. Other?

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days

Would like to work this on!

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days