Tom, Henrique, Michel, Corentin
Oct. - Nov. 2022
https://artefactory.github.io/redis-team-THM/
https://thm-cli.community.saturnenterprise.io/api/docs
This demo showcases the vector search similarity feature of Redis Enterprise.
RediSearch enables developers to add documents and their embeddings indexes to the database, turning Redis into a vector database that can be used for modern data web applications.
See Architecture to see how it works, and User Workflow to see how it can be used.
- Documentation
- History of Changes
- Machine Setup
- Architecture
- User Workflow
- Running The Application
- Benchmarks
- Basic Demo | GitHub
- Redis Vector Similarity Search
- Huggingface Tokenizers + Models
- Cornell University - arXiv dataset,
arxiv-metadata-oai-snapshot.json
file is used FastAPI
,pydantic
,redis-om
redis
see Vector database and JSON storage
- 1/11 - Added a multi-category classifier, a Question Answering engine and a CLI HTTP client to the backend
- 31/10 - Draft blog posts and CLI ETL tool
- 30/10 - Refactored
RedisVentures/redis-arXiv-search
project - 27/10 - Setup Redis Cloud Enterprise and Saturn Cloud accounts and organized within the team
- 15/10 - Added a blog based on Pelican
- 15/10 - Added CI/CD script
- 15/10 - Forked from
RedisVentures/redis-arXiv-search
brew install yarn redis
pip install -r backend/requirements.txt
pip install -r scripts/requirements.txt
The user will perform searches to the Redis database through a REST API HTTP Server.
We wrote a small interactive CLI client tool that performs calls to the HTTP Server and returns papers matching the user queries.
writes pickle and loads index
+-------------------+ +----------------+
| | | |
| Redis +<-----+ ETL CLI |
| | | |
+--------+----------+ +----------------+
^
| reads search index
+--------+----------+
| |
| FastAPI |
| |
+--------+----------+
^
| calls backend
+--------+----------+ +---------------------+
| | | |
| THM CLI +----->+ arxiv.org |
| | | wolfram.alpha.com |
+-------------------+ +---------------------+
researcher uses the THM CLI while writing research
This CLI tool is a quick assistant for a researcher daily activities and helps him improves his efficiency.
It can be used with his text editor and browser and helps him in the process of:
- building bibliography in Markdown or BibTeX formats,
- checking the PDF papers using arXiv website,
- checking scientific facts on Wolfram Alpha website.
graph TD;
welcome_message-->choose_activity;
welcome_message-->configure_parameters;
choose_activity-->search_keywords;
search_keywords-->Search_API;
choose_activity-->search_similar_to;
search_similar_to-->Search_API;
choose_activity-->fetch_paper_details;
fetch_paper_details-->Search_API;
choose_activity-->ask_open_question;
ask_open_question-->HugginFacePipeline;
choose_activity-->find_formula;
find_formula-->Wolfram_Alpha_API;
Setup your Redis Enterprise Cloud then,
cd backend/
./start.sh
open http://0.0.0.0:8080/api/docs
cd scripts/
pip install -r requirements.txt
bash retrain_model.sh
cd scripts/
pip install -r requirements.txt
./thm-cli.py
cd scripts/
pip install -r requirements.txt
./pipeline.sh
# To preview files locally
pelican blog/content && pelican --listen
# To publish on GitHub pages
make publish_blog
The project uses the UKPLab/sentence-transformers
library to compute dense vector representations for sentences found in Cornell's arXiv corpus.
We found the following models interesting NLP models from the leaderboard that community built.
sentence-transformers/all-mpnet-base-v2
has embeddings of size 768 and relative good performancesentence-transformers/all-MiniLM-L6-v2
sentence-transformers/all-MiniLM-L12-v2
has embedding of size 384 and interesting for development as performing inference is faster
We also used transformers.AutoModelForSequenceClassification
for the problem of multi-category classification.
For the problem of Question Answering we used distilbert-base-cased-distilled-squad
.
graph TD;
sentence-transformers/all-MiniLM-L12-v2-->THM_API;
transformers.AutoModelForSequenceClassification-->THM_API;
THM_API-->THM_CLI;
distilbert-base-cased-distilled-squad-->THM_CLI;
See on our blog for the benchmarks we did to evaluate the full solution.
Changes and improvements are welcome! Feel free to fork and open a pull request into main
.