Create a Python virtual environment by issuing the command:
conda create -f environment.yaml
To crawl the abstract texts from one or more arXiv pages,
- Populate the list
start_urls
incrawler/crawler/spiders/arxiv.py
with the URLs of paper abstract pages (e.g., "https://arxiv.org/abs/2102.09105") - Change the current working directory to
crawler
and issue the commandscrapy crawl Arxiv -o papers.csv -t csv
- The output CSV file will be stored under the current working directory.
Having created a CSV file holding the title and abstract of the paper crawled from arXiv, you can now use the summarizer to generate a summary of each paper.
Before that, visit the GitHub repository of Llama2 and follow the instruction to setup Llama. You may need to install PyTorch to your virtual environment in advance.
Then, issue the command:
torchrun --nproc_per_node 1 summarizer/summarize_abstract.py --in_csv {PATH_TO_CSV_FILE}