This is the official repository for OpenResearcher.
Note: This repository is actively maintained and regularly updated to provide the latest features and improvements.
Welcome to OpenResearcher, an advanced Scientific Research Assistant designed to provide a helpful answer to a research query.
With access to the arXiv corpus, OpenResearcher is can provide you with the latest scientific insights.
Explore the frontiers of science with OpenResearcherβwhere answers await.
We release the benchmarking results on various RAG-related system as a leaderboard.
Models | Correctness | Richness | Relevance | ||||||
---|---|---|---|---|---|---|---|---|---|
(Compared to Perplexity) | Win | Tie | Lose | Win | Tie | Lose | Win | Tie | Lose |
iAsk.Ai | 2 | 16 | 12 | 12 | 6 | 12 | 2 | 8 | 20 |
You.com | 3 | 21 | 6 | 9 | 5 | 16 | 4 | 13 | 13 |
Phind | 2 | 26 | 2 | 15 | 7 | 8 | 5 | 13 | 12 |
Naive RAG | 1 | 22 | 7 | 14 | 8 | 8 | 5 | 16 | 9 |
OpenResearcher | 10 | 13 | 7 | 25 | 4 | 1 | 15 | 13 | 2 |
We used human experts to evaluate the responses from various RAG systems. If one answer was significantly better than another, it was judged as a win for the former and a lose for the latter. If the two answers were of similar quality, it was judged as a tie.
Models | Richness | Relevance | ||||
---|---|---|---|---|---|---|
(Compared to Perplexity) | Win | Tie | Lose | Win | Tie | Lose |
iAsk.Ai | 42 | 0 | 67 | 38 | 0 | 71 |
You.com | 15 | 0 | 94 | 16 | 0 | 93 |
Phind | 52 | 1 | 56 | 54 | 0 | 55 |
Naive RAG | 41 | 1 | 67 | 57 | 0 | 52 |
OpenResearcher | 62 | 2 | 45 | 74 | 0 | 35 |
GPT-4 Preference Results compared with Perplexity AI outcome.
To begin using OpenResearcher, you need to install the required dependencies. You can do this by running the following command:
git clone https://github.com/GAIR-NLP/OpenResearcher.git
conda create -n openresearcher python=3.10
conda activate openresearcher
cd OpenResearcher
pip install -r requirements.txt
First, download the latest Qdrant image from Dockerhub:
docker pull qdrant/qdrant
Then, run the service:
docker run -p 6333:6333 -p 6334:6334 \
-v $(pwd)/qdrant_storage:/qdrant/storage:z \
qdrant/qdrant
For more Qdrant installation details, you can follow this link.
OpenResearcher currently supports API models from OpenAI, Deepseek, and Aliyun, as well as most huggingface models supported by vllm.
Modify the API and base URL values in the config.py file located in the root directory to use large language model service platforms that support the OpenAI interface
For example, if you use Deepseek as API provider, and then modify the following value in config.py
::
...
openai_api_base_url = "https://api.deepseek.com/v1"
openai_api_key = "api key here"
...
Please use vllm to setup the API server for open source LLMs. For example, use the following command to deploy a Llama 3 70B hosted on HuggingFace:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 8 \
--dtype auto \
--api-key sk-dummy \
--gpu-memory-utilization 0.9 \
--port 5000
Then we can initialize the chat-llm with config.py
:
...
openai_api_base_url = "http://localhost:5000/v1"
openai_api_key = "sk-dummy"
...
We currently support Bing Search in OpenResearcher. Modify the following value in config.py
:
...
bing_search_key = "api key here"
bing_search_end_point = "https://api.bing.microsoft.com/"
...
1. Download arXiv data (html file) and metadata into the /data
β arXiv data refers to https://info.arxiv.org/help/bulk_data/index.html
β Metadata refers to https://www.kaggle.com/datasets/Cornell-University/arxiv
The directory of data
is formatted as follows:
- data/
- 2401/ # pub date
- 2401.00001/ # paper id
- doc.html # paper content
- 2401.00002/
- doc.html
- 2402/
...
-arxiv-metadata-oai-snapshot.jsonl # metadata
2. Parse the html data
CUDA_VISIBLE_DEVICES=0 python -um connector.html_parsing --target_dir /path/to/target/directory --start_index 0 --end_index -1 \
--meta_data_path /path/to/metadata/file
Parameter explaination:
β target_dir: process the 'target_dir' papers
β start_index,end_index: papers in directory from 'start_index' to 'end_index' will be processed
β meta_data_path: metadata saved path
First run the Qdrant retriever server:
python -um utils.async_qdrant_retriever
Then run the Elastic Search retriever server:
python -um utils.async_elasticsearch_retriever
Then you can run the OpenResearcher system by following command:
CUDA_VISIBLE_DEVICES=0 streamlit run ui_app.py
If this work is helpful, please kindly cite as:
@misc{zheng2024openresearcherunleashingaiaccelerated,
title={OpenResearcher: Unleashing AI for Accelerated Scientific Research},
author={Yuxiang Zheng and Shichao Sun and Lin Qiu and Dongyu Ru and Cheng Jiayang and Xuefeng Li and Jifan Lin and Binjie Wang and Yun Luo and Renjie Pan and Yang Xu and Qingkai Min and Zizhao Zhang and Yiwen Wang and Wenjie Li and Pengfei Liu},
year={2024},
eprint={2408.06941},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2408.06941},
}