ChatData 🔍 📖

We are constantly improving LangChain's self-query retriever. Some of the features are not merged yet.

Yet another chat-with-documents app, but supporting query over millions of files with MyScale and LangChain.

Introduction 📖

Overview

ChatData is a robust chat-with-documents application designed to extract information and provide answers by querying the MyScale free knowledge base or your uploaded documents.

Powered by the Retrieval Augmented Generation (RAG) framework, ChatData leverages millions of Wikipedia pages and arXiv papers as its external knowledge base, with MyScale managing all data hosting tasks. Simply input your questions in natural language, and ChatData takes care of generating SQL, querying the data, and presenting the results.

Enhancing your chat experience, ChatData introduces three key features. Let's delve into each of them in detail.

Feature 1: Retriever Type

MyScale works closely with LangChain, providing the easiest interface to build complex queries with LLM.

Self-querying retriever: MyScale augmented LangChain's Self Querying Retriever, where the LLM can use more data types, for instance timestamps and array of strings, to build filters for the query.

VectorSQL: SQL is powerful and can be used to construct complex search queries. Vector Structured Query Language (Vector SQL) is designed to teach LLMs how to query SQL vector databases. Besides the general data types and functions, vectorSQL contains extra functions like DISTANCE(column, query_vector)and NeuralArray(entity), with which we can extend the standard SQL for vector search.

Feature 2: Session Management

To enhance your experience and seamlessly continue interactions with existing sessions, ChatData has introduced the Session Management feature. You can easily customize your session ID and modify your prompt to guide ChatData in addressing your queries. With just a few clicks, you can enjoy smooth and personalized session interactions.

Feature 3: Building Your Own Knowledge Base

In addition to tapping into ChatData's external knowledge base powered by MyScale for answers, you also have the option to upload your own files and establish a personalized knowledge base. We've implemented the Unstructured API for this purpose, ensuring that only processed texts from your documents are stored, prioritizing your data privacy.

In conclusion, with ChatData, you can effortlessly navigate through vast amounts of data, effortlessly accessing precisely what you need. Whether you're a researcher, a student, or a knowledge enthusiast, ChatData empowers you to explore academic papers and research documents like never before. Unlock the true potential of information retrieval with ChatData and discover a world of knowledge at your fingertips.

➡️ Dive in and experience ChatData on Hugging Face🤗

Data schema

Database credentials:

MYSCALE_HOST = "msc-950b9f1f.us-east-1.aws.myscale.com"
MYSCALE_PORT = 443
MYSCALE_USER = "chatdata"
MYSCALE_PASSWORD = "myscale_rocks"

[NEW] Table `wiki.Wikipedia`

ChatData also provides you access to Wikipedia, a large knowledge base that contains about 36 million paragraphs under 5 million wiki pages. The knowledge base is a snapshot on 2022-12.

You can query from this table with the public account here.

CREATE TABLE wiki.Wikipedia (
    -- Record ID
    `id` String, 
    -- Page title to this paragraph
    `title` String, 
    -- Paragraph text
    `text` String,
    -- Page URL
    `url` String,
    -- Wiki page ID
    `wiki_id` UInt64,
    -- View statistics
    `views` Float32,
    -- Paragraph ID
    `paragraph_id` UInt64,
    -- Language ID
    `langs` UInt32, 
    -- Feature vector to this paragraph
    `emb` Array(Float32), 
    -- Vector Index
    VECTOR INDEX emb_idx emb TYPE MSTG('metric_type=Cosine'), 
    CONSTRAINT emb_len CHECK length(emb) = 768) 
ENGINE = ReplacingMergeTree ORDER BY id SETTINGS index_granularity = 8192

Table `default.ChatArXiv`

ChatData brings millions of papers into your knowledge base. We imported 2.2 million papers with metadata info, which contains:

id: paper's arxiv id
abstract: paper's abstracts used as ranking criterion (with InstructXL)
vector: column that contains the vector array in Array(Float32)
metadata: LangChain VectorStore Compatible Columns
1. metadata.authors: paper's authors in list of strings
2. metadata.abstract: paper's abstracts used as ranking criterion (with InstructXL)
3. metadata.titles: papers's titles
4. metadata.categories: paper's categories in list of strings like ["cs.CV"]
5. metadata.pubdate: paper's date of publication in ISO 8601 formated strings
6. metadata.primary_category: paper's primary category in strings defined by arXiv
7. metadata.comment: some additional comment to the paper

Columns below are native columns in MyScale and can only be used as SQLDatabase

authors: paper's authors in list of strings
titles: papers's titles
categories: paper's categories in list of strings like ["cs.CV"]
pubdate: paper's date of publication in Date32 data type (faster)
primary_category: paper's primary category in strings defined by arXiv
comment: some additional comment to the paper

And for overall table schema, please refer to table creation section in docs/self-query.md.

If you want to use this database with langchain.chains.sql_database.base.SQLDatabaseChain or langchain.retrievers.SQLDatabaseRetriever, please follow guides on data preparation section and chain creation section in docs/vector-sql.md

Where can I get those arXiv data?

From parquet files on S3

Or Directly use MyScale database as service... for FREE ✨

import clickhouse_connect

client = clickhouse_connect.get_client(
    host='msc-950b9f1f.us-east-1.aws.myscale.com',
    port=443,
    username='chatdata',
    password='myscale_rocks'
)

Monthly Updates 🔥 (November-2023)

🚀 Upload your documents and chat with your own knowledge bases with MyScale!
💬 Chat with RAG-enabled agents on both ArXiv and Wikipedia knowledge base!
📖 Wikipedia is available as knowledge base!! Feel FREE 💰 to ask with 36 million of paragraphs under 5 million titles! 💫
🤖 LLMs are now capable of writing Vector SQL - a extended SQL with vector search! Vector SQL allows you to access MyScale faster and stronger! This will be added to LangChain soon! (PR 7454)
🌏 Customized Retrieval QA Chain that gives you more information on each PDF and answer question in your native language!
🔧 Our contribution to LangChain that helps self-query retrievers filter with more types and functions
🌟 We just opened a FREE pod hosting data for ArXiv paper. Anyone can try their own SQL with vector search!!! Feel the power when SQL meets vector search! See how to access the pod here.
📚 We collected about 2 million papers on arxiv! We are collecting more and we need your advice!
More coming...

How to build your own app from scratch 🧱

Quickstart

Enter directory app/

cd app/

Create an virtual environment

python3 -m venv venv
source venv/bin/activate

Install dependencies

python3 -m pip install -r requirements.txt

Run the app!

# fill you OpenAI key in .streamlit/secrets.toml
cp .streamlit/secrets.example.toml .streamlit/secrets.toml
# start the app
python3 -m streamlit run app.py

With LangChain SQLDatabaseRetrievers

Read the full article

With LangChain Self-Query Retrievers