Based on the guide by https://neo4j.com/blog/import-10m-stack-overflow-questions/. Note that this part is I/O intensive due to the CSV format, so it is recommended to store the data on a fast SSD.
- Download up-to-date StackOverflow data dumps from the Internet Archive (https://archive.org/download/stackexchange).
- Extract files (e.g., with WinRaR)
- Generate the CSV files
Note that there are roughly 60 million posts, so this can take more than an hour. The CSV files are saved in new folder called csvs.
python create_csvs.py <path to extracted xml files>
- Clean the data with
This can take up to 20 minutes.
python cleaning.py <path to csv files generated in previous step>
Download 2015 version from https://example-data.neo4j.org/files/stackoverflow/stackoverflow-2015-08-22.csv.tar While this is much faster than Option 1, these posts are old and truncated, so we recommend using the up-todate posts from Option 1.
Although embeddings can be generated directly from Neo4j, but generating them externally offers more felxibility and potentially better performance.
Generate embeddings to each post by running
bash python generate_embeddings.py <path to folder of posts.cvs>
Note that depending on your system this can take up to 10 hours.
Set up the following directory structure in your project folder and copy the relevant CSV files to db/import
:
project_folder/
├── docker-compose.yml
└── db/
├── data/
├── conf/
├── logs/
├── plugins/
└── import/
├── posts_with_embeddings.csv
├── users.csv
├── tags.csv
├── posts_rel.csv
├── posts_answers.csv
├── tags_posts_rel.csv
└── users_posts_rel.csv
- Ensure that you have Docker and Docker Compose installed on your system.
- Check and configure the settings in docker-compose.yml (e.g., port numbers). Don't forget to replace the password with a secure one before deploying in a production environment
- Build and start the Neo4j container from the project folder in detached mode:
docker-compose up -d
Once the container is up and running, run
bash docker-compose stop
to stop the container. This is necessary since we can't import into an active database.
Then import the StackOverflow data into the Neo4j database with
docker-compose run -rm neo4j neo4j-admin database import full --id-type string --array-delimiter='|' --nodes=Post=/var/lib/neo4j/import/posts_with_embeddings.csv --nodes=User=/var/lib/neo4j/import/users.csv --nodes=Tag=/var/lib/neo4j/import/tags.csv --relationships=PARENT_OF=/var/lib/neo4j/import/posts_rel.csv --relationships=ANSWER=/var/lib/neo4j/import/posts_answers.csv --relationships=HAS_TAG=/var/lib/neo4j/import/tags_posts_rel.csv --relationships=POSTED=/var/lib/neo4j/import/users_posts_rel.csv --overwrite-destination=true --verbose
Note that this can take up to 30 minutes Once it is done, restart the container with
docker-compose restart
Open a web browser and navigate to http://localhost:7888 to access your Neo4j database and inspect the schema.
If you are experiencing issues, check the Neo4j logs for detailed error messages:
docker-compose logs neo4j
Execute the following Cypher command to create a vector index on the post embeddings
CREATE VECTOR INDEX post_embedding_index IF NOT EXISTS
FOR (p:Post)
ON p.embedding
- Install Python bindings for llama.cpp to access API through Langchain
pip install llama-cpp-python
- Download an LLM model from https://huggingface.co/models. It should be in the .gguf format otherwise you will need to convert it (see https://github.com/ggerganov/llama.cpp for converters)
Run a vLLM server in the background (requires Linux), below is an example
python -m vllm.entrypoints.openai.api_server \
--model microsoft/Phi-3-mini-128k-instruct \
--max-model-len 4096 \
--dtype bfloat16 \
--gpu-memory-utilization 0.90 \
--port=8002 \
--trust-remote-code \
--disable-log-stats
Depending on your system you might want to change the parameters.
If you are using llama.cpp run:
python hybridsearch.py llamacpp <path_to_your_model_file>
If you are using vLLM run:
python hybridsearch.py vllm <name_of_the_model>
Once the graph RAG model has started you can use a browser to access the gradio interface it generates at http://localhost:7860 .
Once a question is submitted the script will search through the posts in the database, comparing their embeddings with that of the question with two methods:
- Vector search for the most similar posts using Langchain's similarity_search_with_score() method for Neo4j then returns the most similar ones
- Vector search for the most similar posts using using a native Neo4J vector search, then query the graph for the accepted answers to the posts most similar to the question.
The LLM is given two prompts: one without context and one with the context provided by the posts returned by the search.