ChemDataWriter is a transformer-based library for automatically generating research books in the chemistry area.
reader.py
processes the raw XML/HTML paper files and converts them into a Reader object.finder.py
suggests the key topics of the paper corpusretrieve.py
retrieves the papers from the corpus based on the key topicssummariser.py
summarises the papers based on the input textparaphraser.py
paraphrases the summarised texttitle_generator.py
generates the short title in terms of the original long titlelatex_generator.py
reorganises the summarised text and converts it into a LaTeX file
Please install Python version 3.8.10 due to the compatibility of the dependencies.
Create the virtual environment and install the dependencies:
python3.8 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
You could also use pip
to install the toolkit once the paper is published:
pip install chemdatawriter
Docker image is also available:
docker pull sh2009/dockerhub:chemdatawriter.v.0.0.1
docker run -it sh2009/dockerhub:chemdatawriter.v.0.0.1
To create a clean paper file with the title, abstract, introduction, and conclusion from the HTML/XML files, and extract the reference information automatically, run:
python run_corpus.py \
--input_path <path to the input directory> \
--output_path <path to the output JSON file>
To auto-generate the research book using the output JSON file above, run:
python run_cdw.py \
--input_path <path to the input JSON file> \
--output_path <path to the output JSON file> \
--cache_path <path to the cache directory> \
--keywords <keywords to screen the papers> \
--topic_words <topic words of each chapter> \
--chapter_size <number of papers in each chapter>
Developers are welcome to contribute to the project by either raising an issue or submitting a pull request. Please make sure the unit tests are passed before submitting a pull request.
To run the unit tests, run:
python -m unittest
We also provide the evaluation scripts for the topic modelling and paraphrasing models.
Please refer to the demo.ipynb
folder for more details.
Note that you need to install additional dependencies to run the evaluation scripts according to the instructions in the notebook.
This project was financially supported by the Science and Technology Facilities Council (STFC), the Royal Academy of Engineering (RCSRF1819\7\10) and Christ's College, Cambridge. The Argonne Leadership Computing Facility, which is a DOE Office of Science Facility, is also acknowledged for use of its research resources, under contract No. DEAC02-06CH11357.
S.Huang, J. M. Cole, ChemDataWriter: A Transformer-based Toolkit for Auto-generating Books That Summarise Research (2023) DOI: 10.1039/d3dd00159h