The Embed Arxiv is a tool designed to suggest relevant scientific articles based on the user's interests. This project involves downloading metadata from ArXiv, generating vector embeddings for the articles using an embedding model, and employing cosine similarity to recommend similar articles.
- Metadata Download: Collect metadata for scientific articles from ArXiv.
- Embedding Generation: Use a pre-trained embedding model to generate vector representations for the articles.
- Cosine Similarity Calculation: Compute cosine similarity between article vectors to find and recommend relevant articles using Milvus as the vector databse.
Tested on Linux with CUDA 12, and the requirements/environment file follow the same.
Satisfy prerequisites by issuing the command for your choice of package manager
pip install -r requirements.txt
or
conda env create -f environment.yml
Run the files in the following order:
download_arxiv_metadata.py
to download the metadata from kaggleprocess_metadata.py
to convert json to parquet file for efficient computations.split_metadata_by_year.py
to split the metadata by year for batched computations and multiprocessing speedup.embed_abstract.py
to embed the abstract of the papers using theAlibaba-NLP/gte-base-en-v1.5
model.combine_abstract_embeddings.py
to join all the split files into one.transform_abstract_embeddings_milvus.py
to generate files that milvus understands.
embed_all_mxbai_embed_large_v1.py
Embeds Title, Abstract, and Full-text article.
Contributions are welcome! Please fork the repository and submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
-
Embed title and full text articles created from mitanshu7/scientific_dataset_arxiv using Alibaba-NLP/gte-base-en-v1.5 (Max Tokens=8192, Embedding dimensions=768).
-
Embed title, abstract, full text articles created from mitanshu7/scientific_dataset_arxiv using Alibaba-NLP/gte-large-en-v1.5 (Max Tokens=8192, Embedding dimensions=1024) for its larger embedding dimensions.
-
Create a website for recommending scientific articles using these embeddings.
-
Test env creations
- Find Abstract embedded dataset till mid-2024 here: bluuebunny/arxiv_embeddings_Alibaba-NLP_gte-base-en-v1.5 using model Alibaba-NLP/gte-base-en-v1.5 (Max Tokens=8192, Embedding dimensions=768).
- Find Title, Abstract, Full-text article embedded dataset till 2023 here: bluuebunny/embedded_arxiv_dataset_by_year_mxbai-embed-large-v1 using model mixedbread-ai/mxbai-embed-large-v1 (Max Tokens=512, Embedding dimensions=1024).