LangStream/langstream

Add an example application about how to properly deal with stale documents on the vector database

eolivelli opened this issue · 1 comments

All the example applications that we currently have don't show how to deal with these two common issues:

Shorter pages

When you re-index a website then new version of the page may be shorter, so with less chunks.
You can override the chunks with lower ids, but you keep the old chunks with higher ids.
We need to show how to remove stale chunks

Pages that disappeared

This is trickier. When you know that you are re-indexing the whole corpus of documents (for instance a whole website) you should drop the documents that are no more available, the risks are to have outdated documents or to have duplicate content (in case of a page that has been renamed)

The first part has been delivered in the 0.3.0 release