/matsci-nlp-cleaner

Preprocessing for MatSciNLP project

Primary LanguagePython

Set up

  1. Make sure you have python3.6 and the pip module installed. We recommend using conda environments.
  2. Navigate to the root folder of this repository (the same folder that contains this README file) and run pip install -r requirements.txt. Note: If you are using a conda env and any packages fail to compile during this step, you may need to first install those packages separately with conda install package_name.
  3. Wait for all the requirements to be downloaded and installed.
  4. Run python setup.py install to install this module. This will also download the Word2vec model files. If the download fails, manually download the model, word embeddings and output embeddings and put them in mat2vec/training/models.
  5. Finalize your chemdataextractor installation by executing cde data download (You may need to restart your virtual environment for the cde command line interface to be found).
  6. You are ready to go!

Processing

Example python usage:

from mat2vec.processing import MaterialsTextProcessor
text_processor = MaterialsTextProcessor()
text_processor.process("LiCoO2 is a battery cathode material.")

(['CoLiO2', 'is', 'a', 'battery', 'cathode', 'material', '.'], [('LiCoO2', 'CoLiO2')])

For the various methods and options see the docstrings in the code.