RCSB Python I/O Utility Classes

Introduction

This module contains a collection of utility classes for performing I/O operations on common file formats encountered in the PDB data repository.

Installation

Download the library source software from the project repository:

git clone --recurse-submodules https://github.com/rcsb/py-rcsb_utils_io.git

Optionally, run test suite (Python versions 2.7, and 3.9) using setuptools or tox:

python setup.py test

or simply run

tox

Installation is via the program pip.

pip install rcsb.utils.io

or from the local repository:

pip install .

Usage

The MarshalUtil offers an easy way for reading in and writing out files in various formats, including CSV, JSON, pickle, mmCIF, bcif (BinaryCIF), fasta , and "list" files (plain text file in which each row is a list item).

Reading files

Let's say you have a JSON file, "data.json". You can read this in by:

from rcsb.utils.io.MarshalUtil import MarshalUtil
mU = MarshalUtil(workDir=".")

dataD = mU.doImport("data.json", fmt="json")

The same method works even if the file is compressed (e.g., "data.json.gz"):

dataD = mU.doImport("data.json.gz", fmt="json")

Note that this automatic handling of compressed gzip files applies to any type of input format.

You can also import remote files directly from the command line, e.g.:

dataD = mU.doImport("https://files.rcsb.org/pub/pdb/holdings/current_file_holdings.json.gz", fmt="json")

To read in a pickle file, "data.pic":

from rcsb.utils.io.MarshalUtil import MarshalUtil
mU = MarshalUtil()

dataD = mU.doImport("data.pic", fmt="pickle")

To read in and parse an mmCIF file, "4hhb.cif.gz":

from rcsb.utils.io.MarshalUtil import MarshalUtil
mU = MarshalUtil()

# Read all data containers from the mmCIF file into `dataContainerList`
dataContainerList = mU.doImport("https://files.rcsb.org/pub/pdb/data/structures/divided/mmCIF/hh/4hhb.cif.gz", fmt="mmcif")

# Get the first dataContainer (in most cases, there will only be one container in the file)
dataContainer = dataContainerList[0]

# Print the name of the container
eName = dataContainer.getName()
print(eName)

# Get the list of categories
catNameList = dataContainer.getObjNameList()
print(catNameList)

# Iterate over all the categories and attributes and store them in a new dictionary 
cifDataD = {}
for catName in catNameList:
    if not dataContainer.exists(catName):
        continue
    dObj = dataContainer.getObj(catName)
    for ii in range(dObj.getRowCount()):
        dD = dObj.getRowAttributeDict(ii)
        cifDataD.setdefault(eName, {}).setdefault(catName, []).append(dD)

For more examples, see testMarshallUtil.py.

Writing files

You can use the MarshalUtil to write out the following data structures into the corresponding file formats:

 Object            |  Output `fmt`
-------------------------------------
 list              |  list
 dict              |  json or pickle
 DataContainerList |  mmcif or bcif

For example, if you have a dictionary, dataD, you can export it via:

from rcsb.utils.io.MarshalUtil import MarshalUtil
mU = MarshalUtil()

dataD = {"name": "John Doe", "age": "33"}

mU.doExport("data.json", dataD, fmt="json", indent=2)

# Or, to export and compress as gzip:
mU.doExport("data.json.gz", dataD, fmt="json", indent=2)