Experiments on Summarization-Driven Book Generation

This repository contains the main components of the prototype implementation of Beta Writer, the algorithmic author of the first machine-generated research book published by Springer Nature, developed by Niko Schenk, Samuel Rönnqvist and other members of the Applied Computational Linguistics Lab.

Installation

Required dependencies:

brew install python3
pip3 install numpy
pip3 install sklearn
pip3 install scipy
pip3 install matplotlib
pip3 install gensim

Install Mate tools

and place libraries and models into the /mate directory. See mate/README.txt for more details.

Download StanfordCoreNLP and citeproc-java

Ideally open beta_writer as Netbeans project, link downloaded .jar files to project, and build beta_writer.jar. The executable .jar should appear in beta_writer/dist/.

Quickstart

The script pipeline.sh contains all modules for end-to-end book generation.

Please point PYTHON to your local python installation (change line 32 in pipeline.sh)

sh pipeline.sh CORPUS_DIR gen/

where CORPUS_DIR = path to your A++ files and

gen/ = directory containing all generated files

Inspect generated book.html in gen/ folder.

Description

Note that Beta Writer has originally been tailored to consume and process Springer custom-specific document type formats (A++) and does not (yet) support generic PDF.

We currently provide the scripts for the major text processing tasks including:

Preprocessing (e.g., entity masking of chemical compounds with mask_entities.py)
Book structure generation (mkstructure_html.py) and visualization (plot.py)
Syntactic restructuring/paraphrasing (restructuring.py)
Synonym generation (synonyms.py)

The current release makes use of textrank for extractive summarization.

For more implementational details, please refer to our system pipeline description in Section 2.3..

License

This project is open source software and released under the MIT license.

chiarcos/book-gen