/Paper_Title_Recommender

Generates titles for papers based on your abstract and scientific field

Primary LanguagePython

Paper Title Recommender

This repo is a work in progress, and currently not yet usable as described below. This repo contains the code to train and apply a paper title recommender system. There are two title generation modes:

  • MLE title
    In this mode, the most probable (maximum likelihood estimator) title is selected via a (greedy) beamsearch algorithm.
  • Random title
    In this mode, the user can specify the Temperature and generate titles based on random sampling

How to use

Generating Titles

There are two main ways to run the code. The first mode is by specifying the required text fields via de CLI:
$python main --category="CATGEGORY" --abstract="ABSTRACT" --name="FILENAME"

A more convenient, which supports batch processing, is refering to a .csv file:
$python main --file="./inputs/MYFILES.csv" --name="FILENAME"

The results are saved at ./results/FILENAME.txt and ./results/FILENAME.json

An example of the generated output is shown here:
TODO output example
GPT2-medium succeeds at producing good sounding titles based on the given context. One should note that many of the titles produces clearly don't understand the science being discussed. So while the titles generated sound like plausible titles, and may in fact be very suitable, many of the generated titles don't make scientific sense. From manual inspection it appears that the quality of titles deteriorates with shorter abstracts. This makes sense as the algorithm has less context to work with. In these cases it is also the case that a non-expert could not have been generated the actual title given the context.

I recommend you use this algoritm as a way to boost your own creativity and hopefully be inspired by some of the suggestions. \

Training

If you want to do your own training, you can fetch the data using the python kaggle API. Download kaggle.json from the website, and put it in ~/.kaggle/. Next, run: $python fetch_dataset.py This fetches the arxiv metadata dataset and stores it at ./data. You need to unzip the file yourself to continue.

How it works

This project makes use of the Huggingface library to perform abstractive title generation.
This is achieved by finetuning GPT2 on the Arxiv Metdata dataset.
For my own experiments, I finetuned the network for roughly 20.000 timesteps with a batchsize of 8.
Different values might be more optimal, but the results generated with these settings were pleasing upon manual inspection of random samples.