Embed Arxiv

Overview

The Embed Arxiv is a tool designed to suggest relevant scientific articles based on the user's interests. This project involves downloading metadata from ArXiv, generating vector embeddings for the articles using an embedding model, and employing cosine similarity to recommend similar articles.

Features

Metadata Download: Collect metadata for scientific articles from ArXiv.
Embedding Generation: Use a pre-trained embedding model to generate vector representations for the articles.
Cosine Similarity Calculation: Compute cosine similarity between article vectors to find and recommend relevant articles using Milvus as the vector databse.

Installation

Disclaimer

Tested on Linux with CUDA 12, and the requirements/environment file follow the same.

Prerequisites

Satisfy prerequisites by issuing the command for your choice of package manager pip install -r requirements.txt or conda env create -f environment.yml

Usage

Run the files in the following order:

download_arxiv_metadata.py to download the metadata from kaggle
process_metadata.py to convert json to parquet file for efficient computations.
split_metadata_by_year.py to split the metadata by year for batched computations and multiprocessing speedup.
embed_abstract.py to embed the abstract of the papers using the Alibaba-NLP/gte-base-en-v1.5 model.
combine_abstract_embeddings.py to join all the split files into one.
transform_abstract_embeddings_milvus.py to generate files that milvus understands.

Extras

embed_all_mxbai_embed_large_v1.py Embeds Title, Abstract, and Full-text article.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

TO-DO

Embed title and full text articles created from mitanshu7/scientific_dataset_arxiv using Alibaba-NLP/gte-base-en-v1.5 (Max Tokens=8192, Embedding dimensions=768).
Embed title, abstract, full text articles created from mitanshu7/scientific_dataset_arxiv using Alibaba-NLP/gte-large-en-v1.5 (Max Tokens=8192, Embedding dimensions=1024) for its larger embedding dimensions.
Create a website for recommending scientific articles using these embeddings.
Test env creations

Results:

Find Abstract embedded dataset till mid-2024 here: bluuebunny/arxiv_embeddings_Alibaba-NLP_gte-base-en-v1.5 using model Alibaba-NLP/gte-base-en-v1.5 (Max Tokens=8192, Embedding dimensions=768).
Find Title, Abstract, Full-text article embedded dataset till 2023 here: bluuebunny/embedded_arxiv_dataset_by_year_mxbai-embed-large-v1 using model mixedbread-ai/mxbai-embed-large-v1 (Max Tokens=512, Embedding dimensions=1024).

mitanshu7/embed_arxiv