This repository is an extended version of Taylor Denouden's bb-ai-oceanography project. While the original repository focused on analyzing paper abstracts using Large Language Models (LLMs), this fork expands the scope to include full-text analysis and fully-cited report generation of scientific papers related to machine learning applications in oceanography. Report generation could absolutely benefit from the work previously done to identify topic areas, e.g. the report queries could focus on specific topic areas.
- Project Overview
- Key Features
- Getting Started
- Report Configuration
- Usage
- Project Structure
- Acknowledgments
The project aims to provide a comprehensive analysis of the current state of machine learning in oceanography by:
- Downloading full-text scientific papers from various sources
- Processing and extracting relevant information from the papers using paperetl
- Analyzing the content using natural language processing techniques with paperai
- Generating reports to identify trends and patterns in the field
- Multi-source paper retrieval (Elsevier, Springer, Wiley, arXiv, Unpaywall)
- Full-text extraction from PDFs and XMLs using GROBID and paperetl
- Robust error handling and logging
- Rate-limited API calls to respect usage policies
- Machine learning-based analysis and report generation using paperai
Changes made to paperetl
:
- Extracting paragraphs instead of sentences for improved summary quality
- Improved error handling
Changes made to paperai
:
- One of the major improvements to
paperai
's report generation made in this repository is the addition of a properly cited summary section. This is done by using an LLM (either through the API or locally) to generate the summary given a query. See below for an example of what the summary looks like given the query "emerging trends or future directions in machine learning for ocean sciences":
Example Summary
Machine learning (ML) is rapidly becoming a valuable tool in ocean sciences, offering solutions to complex problems that traditional methods struggle with. ML's ability to handle vast datasets, identify patterns, and make predictions is particularly suited for the oceanographic realm, where data is increasingly abundant and complex.
"Machine learning uses dynamic models to make data-driven choices, and ML approaches may be used to high-dimensional, complicated, non-linear, and big-data problems." (Sunkara et al., 2023) It is proving effective in addressing issues like ocean acidification, sea-level rise, and the impacts of climate change on marine ecosystems.
The Southern Ocean, known for its intricate circulation patterns, is a prime example of where ML is making a significant impact. "Recently, machine learning methods are being used to fuel progress within prediction, numerical modelling and beyond, speeding up and refining existing tasks (see review in refs.18,19)." (Sonnewald et al., 2023)
ML's strength lies in its ability to classify complex features in oceanic systems. "Machine learning is a tool that can be leveraged to classify dynamic and complex features in oceanic systems(Jones et al., 2014;Gangopadhyay et al., 2015)." (Phillips et al., 2020) This is particularly valuable for tasks like identifying ocean currents, predicting weather patterns, and understanding the distribution of marine species.
While ML shows immense promise, researchers acknowledge the need for further development and application. "The move from multi-spectral to hyperspectral remote sensing has resulted in the availability of a larger amount of information and more degrees of freedom for improving quantification of PIC. This increase in information can make it challenging for a human to find patterns and analyze the data, but opens the door to machine learning approaches." (Balch et al., 2023)
"In the future, with the explosive growth of marine big data, making efficient use of oceanic big data is still an important research direction. We will continue to explore faster, more accurate and more complex prediction models for ocean hydrological data with a combination of in-depth learning and integrated learning." (Yang et al., 2019) As technology advances and datasets grow, ML will undoubtedly play an increasingly crucial role in unraveling the mysteries of the ocean and ensuring its sustainable future.
- Python 3.7+
- Docker (recommended)
- NVIDIA GPU (optional, for local models)
- One of the following:
- OpenAI API key (recommended)
- Hugging Face account (for API or local models)
- Local GPU for running models (optional)
There are three main ways to use this project, listed in order of recommendation:
Option 1: Docker with OpenAI API (Recommended)
-
Clone the repository:
git clone https://github.com/Spiffical/bb-ai-oceanography.git cd bb-ai-oceanography
-
Create a
.env
file in the project root with your OpenAI API key:OPENAI_API_KEY=your_openai_api_key
-
Build the Docker image:
Ensure Docker is installed and running, then run:
docker build -f docker/Dockerfile.api -t paperai-api .
Option 2: Docker with Local Models
Choose this option if you want to run models locally without API costs.
-
Clone and enter the repository as shown above
-
Choose your preferred local model provider:
A. Using Ollama (Easier)
- Build the Docker image:
docker build -f docker/Dockerfile.gpu -t paperai-gpu .
B. Using Hugging Face (More flexible)
- Create a Hugging Face account and get your access token
- Add to your
.env
file:HUGGING_FACE_HUB_TOKEN=your_token
- Build the Docker image:
docker build -f docker/Dockerfile.gpu -t paperai-gpu .
- Build the Docker image:
Option 3: Local Installation
-
Clone the repository:
git clone https://github.com/Spiffical/bb-ai-oceanography.git cd bb-ai-oceanography
-
Install the required packages:
pip install -r requirements.txt
-
Install paperetl and paperai:
cd paperetl pip install . cd ../paperai pip install . cd ..
-
Set up environment variables if you are downloading papers from the Elsevier, Springer, and Wiley APIs: Create a
.env
file in the project root and add your API keys and email address for Unpaywall:ELSEVIER_API_KEY=your_elsevier_api_key SPRINGER_API_KEY=your_springer_api_key WILEY_API_KEY=your_wiley_api_key UNPAYWALL_EMAIL=your_email_address
Reports are configured through YAML files in the reports/
directory. Each configuration file defines how the report should be generated and what content to analyze.
Here's an example of a report configuration file:
name: Your_Report_Name
options:
# General settings
topn: 100
render: md
qa: "deepset/roberta-base-squad2"
generate_summary: true
# Choose your mode
llm_mode: "api" # "api" or "local"
# API Settings (if llm_mode is "api")
api:
provider: "openai" # "openai" or "huggingface"
model: "gpt-4o-mini" # or other API models
# Local Settings (if llm_mode is "local")
local:
provider: "ollama" # "ollama" or "huggingface"
model: "mistral:instruct" # See supported models below
gpu_strategy: "auto" # For HuggingFace models
sections:
Your_Section_Name:
query: your search query here
columns:
- name: Date
- name: Study
- {name: Custom_Column, query: specific search terms, question: what specific information to extract}
Key components:
name
: The name of the reportoptions
: General settings for the reporttopn
: Number of results (paragraphs from the database) to userender
: Output format (md for markdown)qa
: Model to use for question answeringgenerate_summary
: Whether to include an LLM-generated summaryllm_mode
: Choose between API or local models- Mode-specific settings for API or local model usage
sections
: Define what content to analyzequery
: Main search query for this sectioncolumns
: What information to extract and how to organize it- Simple columns just need a name
- Complex columns can include specific queries and questions
Supported Models:
- OpenAI API: gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo, etc. (see OpenAI API Models)
- Ollama: mistral:instruct, gemma:7b, llama2:7b, etc. (see Ollama Models)
- Hugging Face: google/gemma-2-9b-it, mistralai/Mistral-7B-Instruct-v0.2, etc. (see Hugging Face Models)
Example sections from the default report include:
- ML Applications in ocean sciences
- Research gaps and challenges
- Emerging trends
See the reports/
directory for complete examples.
To generate a report using a pre-processed paperetl embeddings model:
Using OpenAI API
# On Linux/Mac
docker run --rm --env-file ".env" -v "$(pwd):/work" paperai-api -m paperai.report /work/reports/report_file.yml /work/path/to/your/model
# On Windows PowerShell
docker run --rm --env-file ".env" -v "${PWD}:/work" paperai-api -m paperai.report /work/reports/report_file.yml /work/path/to/your/model
Using Local Models with GPU
# On Linux/Mac
docker run --rm --env-file ".env" --gpus all -v "$(pwd):/work" paperai-gpu -m paperai.report /work/reports/report_file.yml /work/path/to/your/model
# On Windows PowerShell
docker run --rm --env-file ".env" --gpus all -v "${PWD}:/work" paperai-gpu -m paperai.report /work/reports/report_file.yml /work/path/to/your/model
Using Local Installation
python -m paperai.report reports/report_file.yml path/to/your/model
Replace:
path/to/your/model
with the path to your embeddings model directoryreports/report_file.yml
with the path to your report configuration file
For example, if you want to use the OpenAI API, your embeddings model is in paperetl/models/pdf-oceanai
, your report configuration file is in reports/report_oceans_gaps.yml
, and you are currently in the bb-ai-oceanography
directory:
docker run --rm -v "$(pwd):/work" paperai-api python -m paperai.report /work/reports/report_oceans_gaps.yml /work/paperetl/models/pdf-oceanai
If you need to process new papers or build the database from scratch, follow these additional steps:
- Download papers:
python scripts/download_papers.py --csv path/to/your/doi_list.csv output_path
Or for a single DOI:
python scripts/download_papers.py --doi 10.1016/j.example.2023.123456 output_path
- Start GROBID with the custom configuration:
sudo docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 -v /path/to/bb-ai-oceanography/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.8.0
Replace /path/to/bb-ai-oceanography
with the actual path to your project directory.
- In a separate terminal, run paperetl to extract content and create an SQLite database:
python -m paperetl.file /path/to/pdfs /path/to/output
Replace /path/to/pdfs
with the directory containing your downloaded PDFs, and /path/to/output
with the desired output directory for the SQLite database.
- Index the extracted content:
python -m paperai.index /path/to/output
Use the same /path/to/output
as in the paperetl step.
- Generate a report using a YAML configuration file (see Report Configuration for details):
python -m paperai.report /path/to/report_config.yml 100 md /path/to/output
Replace:
/path/to/report_config.yml
with the path to your report configuration file (examples can be found in thereports
directory)./path/to/output
with the same/path/to/output
as in the paperetl step.
scripts/
: Contains the main scripts for downloading papersutils/
: Utility functions for API calls, file handling, and PDF processingnotebooks/
: Jupyter notebooks for data analysis and visualizationconfig/
: Configuration files, including the custom GROBID configurationpaperetl/
: The modified paperetl package for extracting content from PDFspaperai/
: The modified paperai package for analyzing and generating reportsdocker/
: Docker configuration files for building and running the projectreports/
: Report configuration files for generating reports
- This work builds upon the initial analysis by Leland McInnes at the Tutte Institute, and Taylor Denouden's bb-ai-oceanography project
- Thanks to all the publishers and platforms providing access to scientific literature
- GROBID for PDF content extraction
- paperetl and paperai for content processing and analysis