karlicoss/promnesia

real time indexing

karlicoss opened this issue · 6 comments

E.g. something inotify based. That would make the implementation quite a bit more complext that it's at the moment.
Also due to the nature of many exports (periodic), it won't be realtime unless the underlying exports are realtime.
Still it could at least detect source files changes, etc.
Also would work well in conjunction with Grasp.

Might need to be careful about closing libmagic #124 (comment)

Relevant: i've implemented 'almost realtime' indexing recently:

INDEX_POLICY = os.environ.get('PROMNESIA_INDEX_POLICY', 'overwrite_all')

E.g. you can have a separate config file only with your text notes (which should be indexed very fast). Then if you run

PROMNESIA_INDEX_POLICY=update promnesia index --config /path/to/small/config, it will merge it into the main database.

That means you can run it very often (e.g. every five minutes), or potentially combine with entr to achieve 'realtime' indexing..

The last comment here needs to make it into main docs.

Even better, if a new option is added like promnesia index --update so that the above preserves existing items in server's database:

promnesia index --update --config <small-config> --secrets <secret-file>

But what about de-duplication? Are there any issues with updates?

Yep, good idea to pass it in cmdline args! It was somewhat experimental at first, so I made it an env variable, but it seems to work pretty well (apart from one minor race condition I might need to fix first).
Maybe even it makes sense to make --update mode the default? I guess the worst that would happen is some stale entries would be in the database -- then if the user notices them, they can do a full reindex manually.

Regarding deduplication -- not sure what do you mean?
This is how it works at the moment

policy_update = update_policy_active()
if not policy_update:
engine = create_engine(f'sqlite:///{tpath}')
else:
engine = create_engine(f'sqlite:///{db_path}')
binder = NTBinder.make(DbVisit)
meta = MetaData(engine)
table = Table('visits', meta, *binder.columns)
meta.create_all()
cleared: Set[str] = set()
with engine.begin() as conn:
for chunk in chunked(vit_ok(), n=_CHUNK_BY):
srcs = set(v.src or '' for v in chunk)
new = srcs.difference(cleared)
for src in new:
conn.execute(table.delete().where(table.c.src == src))
cleared.add(src)

So it clears all the entries corresponding to the data source first and then inserts them. Hopefully shouldn't result in duplication!

hmm seems that it was closed automatically by github -- we don't really have realtime indexing yet, so I'll reopen

Perhaps for actual 'realtime' this would need proper HPI support.
E.g. HPI module exposes a generator or something, which Promnesia can poll on (presumably, in a loop over all promnesia sources).
Not sure how easy it'll be to make it asynchronous enough though, and also going to be tricky to 'expire' stale Visits, but could work well for incremental/synthetic sources (which typically are the most expensive computationally)