/Knowledge-QA-LLM

QA based on local knowledge and LLM.

Primary LanguagePythonApache License 2.0Apache-2.0

简体中文 | English

🧐 Knowledge QA LLM

SemVer2.0 GitHub

📣 We're looking for front-end development engineers interested in Knowledge QA with LLM, who can help us achieve front-end and back-end separation with our current implementation.

  • Questions & Answers based on local knowledge base + LLM.
  • Reason:
  • Advantage:
    • The whole project is modularized and does not depend on the lanchain library, each part can be easily replaced, and the code is simple and easy to understand.
    • In addition to the large language model interface that needs to be deployed separately, other parts can use CPU.
    • Support documents in common formats, including txt, md, pdf, docx, pptx, excel etc. Of course, other types of documents can also be customized and supported.

Architecture

  • Parse the document and store it in the database
    flowchart LR
    
    A([Documents]) --ExtractText--> B([sentences])
    B --Embeddings--> C([Embeddings])
    C --Store--> D[(DataBase)]
    
    Loading
  • Retrieve and answer questions
    flowchart LR
    E([Query]) --Embedding--> F([Embeddings]) --> H[(Database)] --Search--> G([Context])
    E --> I([Prompt])
    G --> I --> J([LLM]) --> K([Answer])
    
    Loading

Installation

  1. Clone the whole repo into local directory.
    git clone https://github.com/RapidAI/Knowledge-QA-LLM.git
  2. Install the requirements.
    cd Knowledge-QA-LLM
    pip install -r requirements.txt
  3. Download the moka-ai/m3e-small model and put it in the assets/models/m3e-small directory. This model is used to vectorize text content.
  4. Separately configure the interface of chatglm2-6b, interface startup reference: ChatGLM2-6B API. The specific usage method Reference: knowledge_qa_llm/llm/chatglm2_6b.py
  5. Write the deployed llm_api to the llm_api_url field in the configuration file knowledge_qa_llm/config.yaml.

Usage

  1. Run

    streamlit run webui.py
  2. UI Demo

  3. CLI Demo

    python cli.py

🛠 Tools Used

📂 File structure

.
├── assets
│ ├── db                # store vector database
│ ├── models            # place the model for extracting embedding
│ └── raw_upload_files
├── knowledge_qa_llm
│ ├── __init__.py
│ ├── config.yaml       # configuration file
│ ├── file_loader       # Handle documents in various formats
│ ├── encoder           # Extract embeddings
│ ├── llm               # Large model interface, the large model needs to be deployed separately and called by interface
│ ├── utils
│ └── vector_utils      # embedding access and search
├── LICENSE
├── README.md
├── requirements.txt
├── tests
├── cli.py
└── webui.py            # UI implementation based on streamlit

Changelog

Click to expand
  • 2023-08-29 v0.0.8 update:
    • Fixed missing embedding_extract
    • Fixed default parameters of LLM
  • 2023-08-11 v0.0.7 update:
    • Optimize layout, remove the plugin option, and put the extract vector model option on the home page.
    • The tips are translated into English for easy communication.
    • Add project logo:🧐
    • Update CLI module code.
  • 2023-08-05 v0.0.6 update:
    • Adapt more llm_api, include online llm api, such ad ERNIE-Bot-Turbo.
    • Add the status of extracting embeddings.
  • 2023-08-04 v0.0.5 update:
    • Fixed the problem of duplicate data inserted into the database.
  • 2023-07-29 v0.0.4 update:
    • Reorganize the UI based streamlit==1.25.0
    • Optimize the code.
    • Record the GIF demo of UI.
  • 2023-07-28 v0.0.3 update:
    • Finish the file_loader part.
  • 2023-07-25 v0.0.2 update:
    • Standardize the existing directory structure, more compact, extract some variables into config.yaml
    • Perfect documentation

Contributing

  • Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
  • Please make sure to update tests as appropriate.