This module contains a collection of utility classes for performing I/O operations on common file formats encountered in the PDB data repository.
Download the library source software from the project repository:
git clone --recurse-submodules https://github.com/rcsb/py-rcsb_utils_io.git
Optionally, run test suite (Python versions 2.7, and 3.9) using setuptools or tox:
python setup.py test
or simply run
tox
Installation is via the program pip.
pip install rcsb.utils.io
or from the local repository:
pip install .
The MarshalUtil
offers an easy way for reading in and writing out files in various formats, including CSV
, JSON
, pickle
, mmCIF
, bcif
(BinaryCIF), fasta
, and "list" files (plain text file in which each row is a list item).
Let's say you have a JSON file, "data.json"
. You can read this in by:
from rcsb.utils.io.MarshalUtil import MarshalUtil
mU = MarshalUtil(workDir=".")
dataD = mU.doImport("data.json", fmt="json")
The same method works even if the file is compressed (e.g., "data.json.gz"
):
dataD = mU.doImport("data.json.gz", fmt="json")
Note that this automatic handling of compressed gzip
files applies to any type of input format.
You can also import remote files directly from the command line, e.g.:
dataD = mU.doImport("https://files.rcsb.org/pub/pdb/holdings/current_file_holdings.json.gz", fmt="json")
To read in a pickle
file, "data.pic"
:
from rcsb.utils.io.MarshalUtil import MarshalUtil
mU = MarshalUtil()
dataD = mU.doImport("data.pic", fmt="pickle")
To read in and parse an mmCIF
file, "4hhb.cif.gz"
:
from rcsb.utils.io.MarshalUtil import MarshalUtil
mU = MarshalUtil()
# Read all data containers from the mmCIF file into `dataContainerList`
dataContainerList = mU.doImport("https://files.rcsb.org/pub/pdb/data/structures/divided/mmCIF/hh/4hhb.cif.gz", fmt="mmcif")
# Get the first dataContainer (in most cases, there will only be one container in the file)
dataContainer = dataContainerList[0]
# Print the name of the container
eName = dataContainer.getName()
print(eName)
# Get the list of categories
catNameList = dataContainer.getObjNameList()
print(catNameList)
# Iterate over all the categories and attributes and store them in a new dictionary
cifDataD = {}
for catName in catNameList:
if not dataContainer.exists(catName):
continue
dObj = dataContainer.getObj(catName)
for ii in range(dObj.getRowCount()):
dD = dObj.getRowAttributeDict(ii)
cifDataD.setdefault(eName, {}).setdefault(catName, []).append(dD)
For more examples, see testMarshallUtil.py.
You can use the MarshalUtil
to write out the following data structures into the corresponding file formats:
Object | Output `fmt`
-------------------------------------
list | list
dict | json or pickle
DataContainerList | mmcif or bcif
For example, if you have a dictionary, dataD
, you can export it via:
from rcsb.utils.io.MarshalUtil import MarshalUtil
mU = MarshalUtil()
dataD = {"name": "John Doe", "age": "33"}
mU.doExport("data.json", dataD, fmt="json", indent=2)
# Or, to export and compress as gzip:
mU.doExport("data.json.gz", dataD, fmt="json", indent=2)