A deep language model, GPT-2, is trained on scientific manuscripts from ArXiv. This pilot study uses abstracts from ~2.1M articles as training data in order to explore correlations in scientific literature from a language modelling perspective. A language models are algorithms used to generate sequences of numers that correspond to tokens or words and can be used to represent sentances. The text samples are fed into the GPT-2 117M and 774M model and trained for ~500,000 steps with fine tuning. After training, the language model is used to generate embeddings for each manuscript which can be clustered for visualization applications and queried for entity searches.
Get started fast:
from transformers import pipeline
ai = pipeline('text-generation',model='pearsonkyle/gpt2-arxiv', tokenizer='gpt2', config={'max_length':1600})
machina = lambda text: ai(text)[0]['generated_text']
A few generated samples are below:
- We can remotely sense an atmosphere by observing its reflected, transmitted, or emitted light in varying geometries. This light will contain information on the planetary conditions including
temperature, pressure, composition, and cloud optical thickness. One such property that is important is...
- The reflectance of Earth's vegetation suggests
that large, deciduous forest fires are composed of mostly dry, unprocessed material that is distributed in a nearly patchy fashion. The distributions of these fires are correlated with temperature, and also with vegetation...
- Directly imaged exoplanets probe
key aspects of planet formation and evolution theory, as well as atmospheric and interior physics. These insights have led to numerous direct imaging instruments for exoplanets, many using polarimetry. However, current instruments take
For the large model (GPT-2 774M) use: pearsonkyle/gpt2-arxiv-large
(Coming soon...)
Dependencies
- Create a new virtual environment (e.g.
conda create -n nlp python=3.9
) - Activate the environment
conda activate nlp
- Install pytorch
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install transformers sqlalchemy sqlalchemy_utils pyuser_agent tqdm ipython jupyter datasets ftfy clean-text unidecode
Make sure sqlalchemy is at a version < 2.0
Training data
- ~1.7 million abstracts from the Arxiv that you can download on Kaggle.
- Additionally, we include a script to query NASA Astrophysical Database
These data are converted include a SQL database which can be sorted and queried in a quick manner allowing for the easy export of training datasets
Pre-processing for text generation/embedding
After downloading the Arxiv json above, run the following to convert it into a sqlite database:
python database.py
to create sql databasepython json2db.py
to populate the db with json data from arxiv
Modify the training data
Use NASA's Astrophysical Database (ADS) to add more abstracts to the database based on a keyword search. See query_ads.py -h
for more details. You will need to sign up for an account on ADS and subscribe for an API key.
python query_ads.py -q "transiting exoplanets"
to add entriespython clean_db.py
to remove entries based on keywords in the abstract
Train on a custom dataset
Train a language model using the commands below:
python db2txt.py
to create a text file with one abstract per line, this script will also clean up various characters in the abstractspython train.py
to train a GPT-2 model
Interested in training this model in the cloud? Try this repo on Google Colab
Embeddings
A language model is used to generate encodings for each manuscript which can be clustered for visualization applications and queried for entity searches. The embeddings are generated from the SciBERT model. The embeddings are then clustered using an approximate nearest neighbor technique (ANNOY) and queried with FAISS to provide recommendations on similar articles to an input prompt.
python db2embedding.py
to use create a vector for each abstract in db using an embedding from the SciBERT model.python db2annoy.py
to create an approximate nearest neighbor treepython eval_nearest.py
or
python text_to_vec.py
to create vectors based on TF-IDF and PCApython eval_tfidf.py
RESTful API
Create a webserver to access the generative model for a predictive keyboard and to be able to find similar abstracts in real time
- check:
api.py
uvicorn api:app --reload
Examples
Text generation and nearest neighbor recommendations in a single app:
python -m bokeh serve --show bokeh_example.py
Upload to iOS
python gpt2_to_coreml.py
References
- https://huggingface.co/roberta-base
- https://huggingface.co/docs
- https://huggingface.co/transformers/training.html
- https://huggingface.co/transformers/notebooks.html
- https://colab.research.google.com/drive/1vsCh85T_Od7RBwXfvh1iysV-vTxmWXQO#scrollTo=ljknzOlNoyrv
- http://jalammar.github.io/illustrated-gpt2/
- https://github.com/huggingface/swift-coreml-transformers.git