How do you get the diagnosis knowledge?

Question

How do you get the diagnosis knowledge?

Closed this issue a year ago · 3 comments

Dear authors,
I really appreciate your efforts in contributing to this wonderful repo. I have some questions about the acquisition of diagnosis knowledge.

I see you have a file diagnosis_code.txt which contains codes to test different root cause. I'm wondering how do you obtain the code? Is it curated during your own DB maintanence experience or obtained from online resources?
And also is it depend on specific software/DB that you want to diagnose, as some metric names can be different?
Do you have any suggestions for me to obtain such knowledge base? I'm thinking about apply your software to our service.

Thanks so much for your consideration!

Answer 1 · 2023-10-13T12:18:24.000Z

Thank you so much for having an interest in d-bot.

Q1. The code comes from the experience of our DBA friends. Currently we still rely on the well-prepared code blocks or maintenance documents (around 100 pages, which we will also release after removing sensitive info). It is a dirty and hard work to obtain knowledge from the open websites with our hands, and much valuable maintenance knowledge is not open-sourced as we know.

Q2. Yes, other than limited general knowledge (e.g., the overall diagnosis steps), the specific knowledge (e.g., metric /view/column names) heavily depends on the target software/DB product and is hard to share across products. If you hope to use d-bot on another system (postgresql we currently support), you need to check and replace the knowledge, monitoring exporter, and tool modules.

Q3. To build such a knowledge base, documents and extraction scripts are most important. We will release a powerful extraction script this week.

All in all, diagnosis is a task "the devil is in the details". We will try to give a more elegant solution for Q2 and Q3. And hopefully learn more from you :)

Answer 2 · 2023-10-13T13:15:37.000Z

Thank you for your answers and suggestions.

I agree that it is really hard to manually curate such knowledge. Some companies have their well-organized diagnosis knowledge base, but it is not open-sourced and also depends on their specific systems.

I've also feelings during diagnosing systems and also hope to discuss with you.
A big problem I met is that we've already have many algorithms to detect anomalies, localize root causes, etc, to help failure diagnosis. But they are also designed for specific systems. I think LLM agents might be a solution to automatically resolve from failures. Use LLM to take a close look into the system behavior and perform reasoning like SREs. It is really like what SREs/DBAs do.

Although I'm not sure how "realiable" the final mitigation suggestions provided by agents could be, I think the diagnosis can be automated. As using some functions to test the system and obtain system metrics does not intrustively affect the system. Before implementing the suggested mitigations, this period is secure. And also humans can check the mitigation suggestions before implementing.

I think your work is really promising. I'm looking forward to your new release!

Answer 3 · 2023-11-02T09:58:54.000Z

Thank you for your answers and suggestions.

I agree that it is really hard to manually curate such knowledge. Some companies have their well-organized diagnosis knowledge base, but it is not open-sourced and also depends on their specific systems.

I've also feelings during diagnosing systems and also hope to discuss with you. A big problem I met is that we've already have many algorithms to detect anomalies, localize root causes, etc, to help failure diagnosis. But they are also designed for specific systems. I think LLM agents might be a solution to automatically resolve from failures. Use LLM to take a close look into the system behavior and perform reasoning like SREs. It is really like what SREs/DBAs do.

Although I'm not sure how "realiable" the final mitigation suggestions provided by agents could be, I think the diagnosis can be automated. As using some functions to test the system and obtain system metrics does not intrustively affect the system. Before implementing the suggested mitigations, this period is secure. And also humans can check the mitigation suggestions before implementing.

I think your work is really promising. I'm looking forward to your new release!

a basic doc2knowledge is available (https://github.com/TsinghuaDatabaseGroup/DB-GPT#-customize)