mindflowai/mindflow

Need more explanations on how Indexing works

AlexNetman opened this issue · 4 comments

  1. faced a bug when I was trying to index a folder that is not under git. Fix it. (workaround - making git init in the folder)
  2. No clear how indexing scope is working. I can not really tell if model is still having in mind my files or not. What kind of experiment can we make to understand that? It worked not good for me when I have indexed few JS files and asked a question - it answered using python common knowledge instead.
  3. I want to manage separate indexes-contexts. For each project. How big those could be? For How long are they stored?

1 - Yes - Indexing may not work as well outside of git repositories for now. It's recommended to use inside of git repositories for the time being, but for now, if this workaround works for you, then it works for me!
2 - Hmm, I am not sure exactly what is going on, since I do not know the commands you were running, but I suspect that you are not passing the path/s to the chat command? If no path is specified, then it is basically ChatGPT in the command line. Running the chat command with a question or query and a path or paths will automatically index all of the files within the path/s. In order to query them, you need to pass the path/s again, or some files contained within, and it will reuse the index previously created, but it will only use the portion of the index which is contained within the path in the query.
3 - Currently, a small amount of data is stored for each file you index in a json file. It would be a large amount of data if you index many files, but it probably works out to 50-500 characters per file depending on the size + some metadata about the file. They are stored indefinitely, but you can delete or view portions of your index with mf inspect *path* and mf delete *path*.

I hope this answers your questions. If you have any more please ask. We'll be working on improving documentation moving forward. As a heads up though, we are looking to rework the indexing and querying mechanism soon, so it should be much cheaper and faster, but will be done differently.

  1. For what purpose is that JSON stored? How is it used? I was expecting to have full file in the context.
    image
    If I have several files indexed in the folder, and I make a change in one of the files, can I re-index only one file?

The file text is what is ultimately used as context, however, when querying, we use a vector embedding similarity approach. We found that we got much better results when comparing vector embeddings with a summary instead of the file text/code directly. Because the summaries were slow and sometimes costly to generate, we save them here. What is ultimately given to chat GPT though is not the summary, but the text of the file - we don't need to save that though.

To your second point, yes. This is how it should work. It will only re-index that changed file.