Create tutorials
gabrielchua opened this issue · 8 comments
Jupyter Notebook style
Hi @gabrielchua , Can you guide me a little on this. I would love to work on this.
Hey @alhridoy,
Thanks for offering.
How about starting with this code snippet? You can make a starter notebook which some examples
from ragxplorer import RAGxplorer
client = RAGxplorer(embedding_model="thenlper/gte-large")
client.load_pdf("presentation.pdf", verbose=True)
client.visualize_query("What are the top revenue drivers for Microsoft?")
Then you can try changing the following:
-
When initializing the RAGxplorer object, you can change the
embedding_model
argument to different embedding models from HuggingFace (e.g.BAAI/bge-large-en
) or OpenAI (e.g. the newtext-embedding-3-small
. -
visualize_query
method has the following arguments:
retrieval_method
which can be:naive
(default),HyDE
ormulti_qns
top_k
which is an int (recommend 3 to 10), defaults to 5.
Feel free to ping here if you run into any issues.
Hi @gabrielchua when i write import plotly.graph_objs as go
the following errors apper, i also installed plotly manually but it does not fix my problem. ModuleNotFoundError Traceback (most recent call last)
Cell In[20], line 1
----> 1 from ragxplorer import RAGxplorer
File ~/Desktop/projects/RAGxplorer/ragxplorer/init.py:7
1 """
2 init.py
3
4 Initializes the ragxplorer package and exposes the main classes and functions.
5 """
----> 7 from .ragxplorer import RAGxplorer
9 all = ['RAGxplorer']
File ~/Desktop/projects/RAGxplorer/ragxplorer/ragxplorer.py:19
11 import pandas as pd
13 from chromadb.utils.embedding_functions import (
14 SentenceTransformerEmbeddingFunction,
15 OpenAIEmbeddingFunction,
16 HuggingFaceEmbeddingFunction
17 )
---> 19 import plotly.graph_objs as go
21 from .rag import (
22 build_vector_database,
23 get_doc_embeddings,
24 get_docs,
25 query_chroma
26 )
28 from .projections import (
29 set_up_umap,
30 get_projections,
31 prepare_projections_df,
32 plot_embeddings
33 )
ModuleNotFoundError: No module named 'plotly' . Any idea?
Hi @alhridoy
Are you using a virtual environment? Could you run pip freeze
and provide the results?
Hi @gabrielchua Thanks. Yes i used virtual environment. Here is the result of pip freeze.
absl-py==2.0.0
abstract_singleton==1.0.1
affine==2.4.0
aiofiles==23.2.1
aiohttp==3.8.4
aiosignal==1.3.1
anthropic==0.3.6
anyio==3.7.1
appdirs==1.4.4
appnope==0.1.3
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asciitree==0.3.3
asgiref==3.7.2
astor==0.8.1
asttokens==2.4.1
astunparse==1.6.3
async-generator==1.10
async-lru==2.0.4
async-timeout==4.0.2
asynctest==0.13.0
attr==0.3.2
attrs==23.1.0
auto_gpt_plugin_template==0.0.3
autoflake==2.1.1
autopep8==2.0.2
Babel==2.14.0
backcall==0.2.0
backoff==2.2.1
bcrypt==4.0.1
beautifulsoup4==4.12.2
bert4keras==0.11.4
black==23.3.0
bleach==6.1.0
blessed==1.20.0
blis==0.7.9
boltons==21.0.0
bracex==2.4
cachetools==5.3.0
camel-converter==3.0.3
catalogue==2.0.8
certifi==2023.11.17
cffi==1.15.1
cfgv==3.3.1
channels==4.0.0
chardet==5.1.0
charset-normalizer==3.1.0
chroma-hnswlib==0.7.3
chromadb==0.4.15
click==8.1.3
click-option-group==0.5.6
click-plugins==1.1.1
cligj==0.7.2
cloudpickle==2.2.1
colorama==0.4.6
coloredlogs==15.0.1
comm==0.2.1
confection==0.0.4
contourpy==1.1.1
coverage==7.2.3
crcmod==1.7
cryptography==40.0.2
cssselect==1.2.0
cymem==2.0.7
dataclasses-json==0.5.14
debugpy==1.8.0
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.1.1
distlib==0.3.6
distro==1.8.0
Django==4.2.2
dnspython==2.3.0
docker==6.0.1
docopt==0.6.2
docutils==0.18.1
duckduckgo-search==2.8.6
earthengine-api==0.1.374
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl
et-xmlfile==1.1.0
exceptiongroup==1.1.1
executing==2.0.0
expecttest==0.1.6
face==22.0.0
faiss-cpu==1.7.4
fastapi==0.104.1
fastavro==1.8.4
fasteners==0.19
fastjsonschema==2.18.1
filelock==3.12.0
fiona==1.9.5
fire==0.4.0
flake8==6.0.0
flatbuffers==23.5.26
fqdn==1.5.1
frozenlist==1.3.3
fsspec==2023.9.2
geopandas==0.14.0
ghp-import==2.1.0
git-python==1.0.3
gitdb==4.0.10
GitPython==3.1.31
glom==22.1.0
google-api-core==2.11.0
google-api-python-client==2.86.0
google-auth==2.17.3
google-auth-httplib2==0.1.0
google-cloud-core==2.3.3
google-cloud-storage==2.11.0
google-crc32c==1.5.0
google-resumable-media==2.6.0
googleapis-common-protos==1.59.0
greenlet==3.0.0
grpcio==1.59.0
grpcio-tools==1.59.0
gTTS==2.3.1
h11==0.14.0
h2==4.1.0
h5py==3.8.0
hpack==4.0.0
httpcore==0.17.0
httplib2==0.22.0
httptools==0.6.1
httpx==0.24.0
huggingface-hub==0.16.4
humanfriendly==10.0
hyperframe==6.0.1
hypothesis==6.88.1
identify==2.5.22
idna==3.4
imagesize==1.4.1
immutabledict==3.0.0
importlib-metadata==6.8.0
importlib-resources==6.1.0
iniconfig==2.0.0
inquirer==3.1.3
ipykernel==6.29.0
ipython==8.20.0
ipywidgets==8.1.1
isodate==0.6.1
isoduration==20.11.0
isort==5.12.0
jedi==0.19.1
Jinja2==3.1.2
joblib==1.3.2
json5==0.9.14
jsonpointer==2.4
jsonschema==4.19.1
jsonschema-spec==0.2.4
jsonschema-specifications==2023.7.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.9.0
jupyter-lsp==2.2.2
jupyter_client==8.6.0
jupyter_core==5.7.1
jupyter_server==2.12.5
jupyter_server_terminals==0.5.2
jupyterlab==4.0.11
jupyterlab-widgets==3.0.9
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.2
Keras==2.3.1
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.2
kubernetes==28.1.0
lancedb==0.1.16
langchain==0.0.231
langchainplus-sdk==0.0.20
langcodes==3.3.0
lazy-object-proxy==1.9.0
libcst==1.0.1
litellm==0.1.824
locket==1.0.0
loguru==0.6.0
lxml==4.9.2
Markdown==3.3.7
markdown-it-py==3.0.0
MarkupSafe==2.1.2
marshmallow==3.20.1
matplotlib-inline==0.1.6
mccabe==0.7.0
mdurl==0.1.2
meilisearch==0.21.0
mergedeep==1.3.4
-e git+https://github.com/geekan/metagpt@ee4d59cd396813be5e5fb674f9c7a40184ad86c9#egg=metagpt
mistune==3.0.2
mkdocs==1.4.2
monotonic==1.6
more-itertools==10.1.0
mpmath==1.3.0
multidict==6.0.4
murmurhash==1.0.9
mypy-extensions==1.0.0
nbclient==0.9.0
nbconvert==7.14.2
nbformat==5.9.2
nest-asyncio==1.5.8
networkx==3.1
nltk==3.8.1
nodeenv==1.7.0
notebook==7.0.7
notebook_shim==0.2.3
numcodecs==0.12.0
numexpr==2.8.7
numpy==1.24.3
oauthlib==3.2.2
objsize==0.6.1
onnxruntime==1.16.1
open-interpreter==0.1.7
openai==0.28.1
openapi-core==0.18.1
openapi-python-client==0.13.4
openapi-schema-pydantic==1.2.4
openapi-schema-validator==0.6.2
openapi-spec-validator==0.6.0
openpyxl==3.1.2
opentelemetry-api==1.20.0
opentelemetry-exporter-otlp-proto-common==1.20.0
opentelemetry-exporter-otlp-proto-grpc==1.20.0
opentelemetry-proto==1.20.0
opentelemetry-sdk==1.20.0
opentelemetry-semantic-conventions==0.41b0
optree==0.9.2
orjson==3.9.8
outcome==1.2.0
overrides==7.4.0
packaging==23.1
pandas==2.0.3
pandocfilters==1.5.1
parse==1.19.1
parso==0.8.3
pathable==0.4.3
pathspec==0.11.1
pathy==0.10.1
peewee==3.17.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.5.0
pinecone-client==2.2.1
platformdirs==3.2.0
playsound==1.2.2
plotly==5.18.0
pluggy==1.0.0
portalocker==2.8.2
posthog==3.0.2
prance==23.6.21.0
pre-commit==3.2.2
preshed==3.0.8
prometheus-client==0.19.0
prompt-toolkit==3.0.43
proto-plus==1.22.3
protobuf==4.22.3
psutil==5.9.5
ptyprocess==0.7.0
pulsar-client==3.3.0
pure-eval==0.2.2
py==1.11.0
Do you mind creating a new virtualenv and just do pip install ragxplorer
and pip install jupyterlab
I'd like to help too. Could somebody who's been working on this (@alhridoy ?) already share their ipynb so that we don't have to 'relearn' what they have already learned?
Thanks for adding the ipynb! Just a few points/request:
- is there a way to specify we want to use a GPU? (it's VERY slow now)
- could you also add a code snippet on how to feed it a text column from - for instance - a pandas dataframe that contains multiple rows representing multiple documents ? (e.g. I use GROBID to parse out things like headers and footers and footnotes etc. - and then get a 'text' column with the full text of a pdf)? Or elements from a database?
- we now see 'retrieved', 'chunks' and 'original query' in the viz, and can hover over these - but could we also make the 'fill' of those text boxes transparent, so that we can still see where the other nodes are that we may want to explore? And could we also see the file name (for instance) in that box that the displayed text is coming from?
- could we also add an explanation of what exactly these 'retrieved' and 'chunks' categories are?
- finally - is there a way to ALSO use the LLM to provide an actual answer to the query (as opposed to just the viz)?
Just suggestions!