CEDe: A Python repository from rshormazabal

CEDe: A collection of expert-curated datasets with atom-level entity annotations for Optical Chemical Structure Recognition

About CEDe

Optical Chemical Structure Recognition (OCSR) deals with the translation from chemical images to molecular structures, which is the main way chemical compounds are depicted in scientific documents. Traditional rule-based methods follow a framework based on the detection of atoms and bonds, followed by the reconstruction of the compound structure. Recently, neural architectures analog to image captioning have been explored to solve this task, yet they still show to be data inefficient, using millions of examples just to show performance comparable with traditional methods. Looking to motivate and benchmark new approaches based on atomic-level entities detection and graph reconstruction, we present CEDe, a unique collection of chemical entity bounding boxes manually curated by experts for scientific literature datasets. These annotations combine to more than 700,000 chemical entity bounding boxes with the necessary information for structure reconstruction. Also, a large synthetic dataset containing 1 million molecular images and annotations is released in order to explore transfer-learning techniques that could help these architectures perform better under low-data regimes. Benchmarks show that detection-reconstruction based models can achieve performances on par with or better than image captioning-like models, even with 100x fewer training examples.

This repository is the contains currently contains the code for sampling, synthetic data generation and visualization of the CEDe dataset. Models and implementations for benchmarks will be released soon.

Download CEDe

We provide different options for downloading the CEDe dataset. Image data and annotations can be downloaded separately or as one compressed file. Also, different dataset sizes are provided (every smaller dataset is fully contained in bigger versions).

CEDe real data

Full (135.7MB) | Annotations (194MB) | Train split annotations (38.5MB) | Test split annotations (156MB) | Images (53.6MB)

Synthetic data

10K images: Full (334MB) | Annotations (177MB) | Images (320MB)

50K images: Full (1.6GB) | Annotations (887MB) | Images (1.6GB)

100K images: Full (3.3GB) | Annotations (1.7GB) | Images (3.1GB)

1M images: Full (32.5GB) | Annotations (17.3GB) | Images (31.2G)

Getting Started

Prerequisites

First, RDKit must be installed following RDKit installation.

PIP

git clone https://github.com/rshormazabal/CEDe.git
cd CEDe
pip install -r requirements.txt

Conda

git clone https://github.com/rshormazabal/CEDe.git
cd CEDe
conda env create -f environment.yml -n cede_generation

Config file documentation

A detailed example for the config file can be found in ./conf/config_example.yaml.

root_folder: Project root folder. [str] 
annotations_json_filename: Synthetic CEDe annotations JSON filename. [str] 
sampling:
  data_path: Data folder path. [str]
  dataset_name: Whether to use 'SMILES' or 'InChI' as dataset. [str](SMILES, InChI)
  download: Redownload PUBCHEM dataset. [bool]
  load_pickle: Load previously generated metadata pickle file. [bool]
  pickle_filename: Metadata pickle filename. [str]

  n_jobs: Number of jobs for metadata generation. [int]

  pubchem_nrows: Number of rows to read out of pubchem file before filtering. [int]
  max_number_data: Maximum number of data to generate after filtering. [int]

  # parameters to filter cases
  sequence_filters:
    # SMILES compounds containing these characters are kept. However, other token are also 
    # included in final annotations, since they can be contained within these structures.
    chars_to_keep: List of non-atom characters to filter dataset. [list of str]
    atoms_to_keep: List of atoms to filter dataset. [list of str]
    # SMILES compounds containing these characters will be removed.
    chars_to_drop: List of characters to remove structures from dataset. [list of str]
    max_len: Maximum length of SMILES string. [int]
    min_len: Minimum length of SMILES string. [int]

  metadata_filters:
    max_atoms: Maximum number of atoms in a molecule (not characthers). [int]

generation:
  generated_data_folder: Path to store generated data. [str] 
  pseudo_type: Sets the pseudoatom generation from ['R', 'random', 'given']. [str]
  # 'R' generates on only '[R{number}]' style pseudoatoms. 
  # 'random' generates a random string to attach as pseudoatom.
  # 'given' choses from the file specified in 'pseudoatomos_lib_path' 
  pseudo_prob: Probability to replace an atom with a pseudoatom. [float]
  img_size: Size of generated images. [int]
  n_jobs: Number of jobs for the data generation process. [int]
  pseudoatoms_lib_path: Path to pseudoatoms library csv file. [str] 
  bbox_margin: Range for letter instance margins. (int, int)         
  non_letter_carbon_margin: Range for non-letter carbon instance margins. (int, int) 
  aug_sample_params:
    linewidth_range: Range for linewidth augmentation. (int, int)
    font_size_range: Range for font size augmentation. (int, int)
    rotation_angle_range: Range for rotation angle augmentation. (int, int)
    xy_sheer_range: Range for xy sheer augmentation. (int, int)
    fonts_path: Path to library of font files. [str]
coco_dataset_metadata:
  description: Description of the dataset instance. [str] 
  url: URL to download dataset. [str] 
  version: Identifier version. [str]
  year: Year of dataset creation. [int]
  contributor: Contributor name. [str]
  creation_date: Date of dataset creation. [str]
  license_url: License URL. [str] 
  license_name: License name. [str]
global_seed: Global seed for the project (sets numpy, pandas, random, torch, etc). [int]

How to run

After installing dependencies, you can generate data by specifying a hydra config filename in main_generation.py. For specificying specific parameters directly on the CLI, refer to Hydra documentation.

python main_generation.py <HYDRA OPTIONS>

Example

python main_generation.py sampling.pubchem_nrows=1000000 sampling.max_number_data=5000000 generation.pseudo_prob=0.8

License

Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) https://creativecommons.org/licenses/by-nc/4.0/legalcode