GenAI Project Template and Notes (periodically updated)

This repository maintains a limited selection of code, resources and articles related to the field of GenAI and its application to chatbots (with a focus on RAG-type architectures). The general purpose is to test the solutions proposed in this field using LLMs. Any contribution is welcome.

In this context, we divided the content of our notes and testing following the GenAI development applications model proposed in the course dictated by Deeplearning.ai. The model or template has 6 steps:

Define use case
Chose an existing model or pre-train your own
Adapt and align model
1. prompt engineering
2. fine tunning
3. align with human feedback
Evaluate
Optimize
Deploy

RAG-Paradigms

arxiv

Tools and resources:

Use case: GenAI for chatbot in the Finance Sector with RAG:
1. As a linguistic object, financial statements are characterized by a unique blend of features. They consist of formal, technical language with a heavy reliance on specialized financial and accounting terminology. The structure is highly standardized and regulated, ensuring a consistent format across various documents. The language is predominantly objective, focusing on quantitative data and factual information. It's also legally cautious, often including disclaimers and cautionary statements. Narrative elements are present, especially in sections like Management’s Discussion and Analysis (MD&A), providing qualitative insights. The use of passive voice is common, emphasizing actions and results over the entities performing them. Additionally, these documents feature a mix of concise yet comprehensive descriptions, ensuring clarity and specificity. Speculative language is used carefully in forward-looking statements, indicating projections and expectations. Companies may also be cautious in revealing sensitive data that could advantage competitors. Therefore, while financial statements provide key financial data, the presentation is often calibrated to serve both transparency and corporate strategy.
2. Data characteristics and format? Rich format documents!
  1. Tables: Langchain_1, Microsoft tabletransformer, RAG-Table,
    1. Table_reasoning RegHNT, github
    2. Table_reasoning UniRPG, github
  2. Stream_response: langchain_1, langchain
  3. Multimodality: Langchain, RAG-table-image,
3. Legal compliance requirements? High Risk AI according to EU_AI_Act, Consumer_Financial_Protectional_Bureau_US_on_Chatbots,
4. Solutions (well, almost):
  1. RAG-ADVACE_Azure-AISearch-OpenAI, RAG-SIPLE_Azure_AISearch-OpenAI
    1. Test-OpenAI-Chat: Playground
  2. SECInsights
  3. OpenAI-RAG
  4. Azure-GPT-RAG - youtube:globant!
  5. OpenAI-Langchain-Redis:FinTemplate
  6. OpenAI-Agents-Finacial, colab
Existing models:
1. GPT-Playground
2. LLama2-chat
3. FinMA
4. FinGPT
5. Mistral-7b
6. Mistral-8x7b-SMoe, mistral-on-colab, 2, 3, 4
7. finBert
8. FinanceConnect-13b
9. LLM360
10. Phi-2
11. ...
12. Private Models: BloombergGPT, interesting info (e.g.Training Chronicles)
Adapt and Align (AA):
1. Agregations and Math:
  1. LLM-Compiler, llama-api
2. AA:Prompt:
  1. MedPrompt
  2. PROMT_GUIDE
  3. [OPENAI_PROMPT_GUIDE]
3. AA:FineTune:
  1. AdaptLLMstoDomains
  2. ft_llama2_LoRA: summarization and NER.
  3. ...
  - Datasets:
4. AA: Aling and HF
  1. Pearl
  2. DPO
Evaluation
1. Promptbench
2. TrueLens, 2
3. Lanmgchain-Huggingface, or Langchain, or Langchain
4. Promptfoo
Benchmarks
1. FinanceBench, github, whitepaper
2. FinQA, github
3. TAT-QA, github
4. ConFIRM, github
5. FLANG-FLUE, huggingface,
Optimize
1. FineTunning_OpeAI
Deploy
1. Azure-AISearch-OpenAI, for creating dataset see ConFIRM,github

Scripts Tested

All testing is made in a VM on Google Cloud free tier: 24 vCPU, 84G RAM, 100G Disk, Ubuntu 22. I made an installation script to run a non-secure IDE. When the installation is finish, to create the Python environment follow:

a. python3 -m venv ~/.genai0
b. source ~/.genai0/bin/activate
c. python3 -m pip install --upgrade pip
d. pip install -r requirements.txt
e. Set interpreter in Project settings: Type and select "Python: Select Interpreter." Choose the interpreter from your .genai0 virtual environment. It should be something like /root/.genai0/bin/python3.

Open-Soruce models tested
Private Model tested:
1. gpt-3.5-turbo-trulens-eval

Querying Strategies and VectorDB

QS:
2. Basics 3. Advaced
1. SubQuestionQueryEngine for complex questions
2. Small-to-big retrieval for improved precision
3. Metadata filtering, also for improved precision
4. Hybrid search including traditional search engine techniques: IMPORTATN
5. Recursive Retrieval for complex documents: RecursiveRetriver
6. Text to SQL
7. Multi-document agents that can combine all of these techniques
8. Ensembles: EnsembleRetriever
VDB (for performace comparison: vectorview, ANN-Benchmarks)
1. Qdrant, llamaQdrant, performance_evaluation
2. Azure_AI-Searh_docu, vector-search, code
DB configuration:
1. RAG in Azure AI Search, video