BatteryDataExtractor is a battery-aware text-mining software embedded with BERT models for automatically extracting chemical information from scientific literature. Full details available at Documentation.
- Open-source battery-specific literature-mining toolkit
- Double-turn question-answering model for the data extraction of materials and properties
- BERT-based token-classification models: abbreviation detection, part-of-speech tagging, chemical-named-entity recognition
- State-of-the-art performance on downstream evaluation data sets
- Updated NLP plugins: new web scrapers, document readers, and tokenizers
- New options: database auto-saving, original text-saving, and device-selection
pip install batterydataextractor
Note: We only support Python version <= 3.9.13 due to the Spacy conflict issue.
>>> from batterydataextractor.doc import Document
>>> doc = Document("The theoretical capacity of graphite is 372 mAh/g... In the case of LiFePO4 chemistry, the absolute maximum voltage is 4.2V per cell.")
>>> doc.add_models_by_names(["capacity", "voltage"])
>>> records = doc.records
>>> for r in records:
>>> print(r.serialize())
{'PropertyData': {'value': [372.0], 'units': 'mAh / g', 'raw_value': '372 mAh / g', 'specifier': 'capacity', 'material': 'graphite', 'confidence_score': 0.6248}}
{'PropertyData': {'value': [4.2], 'units': 'V', 'raw_value': '4.2 V', 'specifier': 'voltage', 'material': 'LiFePO4', 'confidence_score': 0.6432}}
Provide the name of the general information:
>>> from batterydataextractor.doc.text import Paragraph
>>> text = '1H NMR spectra were recorded on a Varian MR-400 MHz instrument.'
>>> doc = Paragraph(text)
>>> doc.add_general_models(["apparatus"], confidence_threshold=0.1, original_text=True)
>>> for record in doc.records:
>>> print(record.serialize())
{'GeneralInfo': {'answer': 'Varian MR - 400 MHz instrument', 'specifier': 'apparatus', 'confidence_score': 0.5065, 'original_text': '1H NMR spectra were recorded on a Varian MR - 400 MHz instrument .'}}
}}
Ask self-defined questions:
>>> from batterydataextractor.doc.text import Paragraph
>>> text = 'For current LIBs based on OLE system, the employed cathodes could be mainly divided into two categories: LCO is still very popular in the consumer electronics market and Ni-rich compounds have already taken a place in the electric vehicles where the Tesla LiNi0.8Co0.15Al0.05O2 (NCA) cathode is a good example.'
>>> doc = Paragraph(text)
>>> doc.add_general_models(["Which cathode is commonly used in electric vehicles?"], confidence_threshold=0.1, self_defined=True)
>>> for record in doc.records:
>>> print(record.serialize())
{'GeneralInfo': {'answer': 'Ni - rich compounds', 'specifier': 'Which cathode is commonly used in electric vehicles?', 'confidence_score': 0.1489}}
Usage of new NLP toolkit can be found at Documentation. BERT-based functionalities include part-of-speech (POS) tagging, abbreviation detection, and chemical named entity recognition.
This project was financially supported by the Science and Technology Facilities Council (STFC), the Royal Academy of Engineering (RCSRF1819\7\10) and Christ's College, Cambridge. The Argonne Leadership Computing Facility, which is a DOE Office of Science Facility, is also acknowledged for use of its research resources, under contract No. DEAC02-06CH11357.
S. Huang, J. M. Cole, BatteryDataExtractor: battery-aware text-mining software embedded with BERT models, Chem. Sci., 2022,13, 11487-11495.
@article{huang2022batterydataextractor,
title={BatteryDataExtractor: battery-aware text-mining software embedded with BERT models},
author={Huang, Shu and Cole, Jacqueline M},
journal={Chemical Science},
volume={13},
number={39},
pages={11487--11495},
year={2022},
publisher={Royal Society of Chemistry}
}