Get Started | Sampling | Design | Conditioners | License
Chroma is a generative model for designing proteins programmatically.
Protein space is complex and hard to navigate. With Chroma, protein design problems are represented in terms of composable building blocks from which diverse, all-atom protein structures can be automatically generated. As a joint model of structure and sequence, Chroma can also be used for common protein modeling tasks such as generating sequences given backbones, packing side-chains, and scoring designs.
We provide protein conditioners for a variety of constraints, including substructure, symmetry, shape, and neural-network predictions of some protein classes and annotations. We also provide an API for creating your own conditioners in a few lines of code.
Internally, Chroma uses diffusion modeling, equivariant graph neural networks, and conditional random fields to efficiently sample all-atom structures with a complexity that is sub-quadratic in the number of residues. It can generate large complexes in a few minutes on a commodity GPU. You can read more about Chroma, including biophysical and crystallographic validation of some early designs, in our paper, Illuminating protein space with a programmable generative model. Nature 2023.
Note: An API key is required to download and use the pretrained model weights. It can be obtained here.
Colab Notebooks. The quickest way to get started with Chroma is our Colab notebooks, which provide starting points for a variety of use cases in a preconfigured, in-browser environment
- Chroma Quickstart: GUI notebook demonstrating unconditional and conditional generation of proteins with Chroma.
- Chroma API Tutorial: Code notebook demonstrating protein I/O, sampling, and design configuration directly in
python
. - Chroma Conditioner API Tutorial: A deeper dive under the hood for implementing new Chroma Conditioners.
PyPi package.You can install the latest release of Chroma with:
pip install generate-chroma
Unconditional monomer. We provide a unified entry point to both unconditional and conditional protein design with the Chroma.sample()
method. When no conditioners are specified, we can sample a simple 200-amino acid monomeric protein with
from chroma import Chroma
chroma = Chroma()
protein = chroma.sample(chain_lengths=[200])
protein.to("sample.cif")
display(protein)
Generally, Chroma.sample()
takes as input design hyperparameters and Conditioners and outputs Protein
objects representing the all-atom structures of protein systems which can be loaded to and from disk in PDB or mmCIF formats.
Unconditional complex. To sample a complex instead of a monomer, we can simply do
from chroma import Chroma
chroma = Chroma()
protein = chroma.sample(chain_lengths=[100, 200])
protein.to("sample-complex.cif")
Conditional complex. We can further customize sampling towards design objectives via Conditioners and sampling hyperparameters. For example, to sample a C3-symmetric homo-trimer with 100 residues per monomer, we can do
from chroma import Chroma, conditioners
chroma = Chroma()
conditioner = conditioners.SymmetryConditioner(G="C_3", num_chain_neighbors=2)
protein = chroma.sample(
chain_lengths=[100],
conditioner=conditioner,
langevin_factor=8,
inverse_temperature=8,
sde_func="langevin",
potts_symmetry_order=conditioner.potts_symmetry_order)
protein.to("sample-C3.cif")
Because compositions of conditioners are conditioners, even relatively complex design problems can follow this basic usage pattern. See the demo notebooks and docstrings for more information on hyperparameters, conditioners, and starting points.
Robust design. Chroma is a joint model of sequence and structure that uses a common graph neural network base architecture to parameterize both backbone generation and conditional sequence and sidechain generation. These sequence and sidechain decoders are diffusion-aware in the sense that they have been trained to predict sequence and side chain not just for natural structures at diffusion time
While all results presented in the Chroma publication were done with exact design at
The value of diffusion time conditioning design_t
parameter in Chroma.sample
and Chroma.design
. We find that for generated structures,
Design a la carte. Chroma's design network can be accessed separately to design, redesign, and pack arbitrary protein systems. Here we load a protein from the PDB and redesign as
# Redesign a Protein
from chroma import Protein, Chroma
chroma = Chroma()
protein = Protein('1GFP')
protein = chroma.design(protein)
protein.to("1GFP-redesign.cif")
Clamped sub-sequence redesign is also available and compatible with a built-in selection algebra, along with position- and mutation-specific mask constraints as
# Redesign a Protein
from chroma import Protein, Chroma
chroma = Chroma()
protein = Protein('my_favorite_protein.cif') # PDB is fine too
protein = chroma.design(protein, design_selection="resid 20-50 around 5.0") # 5 angstrom bubble around indices 20-50
protein.to("my_favorite_protein_redesign.cif")
We provide more examples of design in the demo notebooks.
Protein design with Chroma is programmable. Our Conditioner
framework allows for automatic conditional sampling under arbitrary compositions of protein specifications, which can come in the forms of restraints (biasing the distribution of states) or constraints (directly restrict the domain of underlying sampling process); see Supplementary Appendix M in our paper. We have pre-defined multiple conditioners, including for controlling substructure, symmetry, shape, semantics, and natural-language prompts (see chroma.layers.structure.conditioners
), which can be used in arbitrary combinations.
Conditioner | Class(es) in chroma.conditioners |
Example applications |
---|---|---|
Symmetry constraint | SymmetryConditioner , ScrewConditioner |
Large symmetric assemblies |
Substructure constraint | SubstructureConditioner |
Substructure grafting, scaffold enforcement |
Shape restraint | ShapeConditioner |
Molecular shape control |
Secondary structure | ProClassConditioner |
Secondary-structure specification |
Domain classification | ProClassConditioner |
Specification of class, such as Pfam, CATH, or Taxonomy |
Text caption | ProCapConditioner |
Natural language prompting |
Sequence | SubsequenceConditioner |
Subsequence constraints. |
How it works. The central idea of Conditioners is composable state transformations, where each Conditioner is a function that modifies the state and/or energy of a protein system in a differentiable way (Supplementary Appendix M). For example, to encode symmetry as a constraint we can take as input the assymetric unit and tesselate it according to the desired symmetry group to output a protein system that is symmetric by construction. To encode something like a neural network restraint, we can adjust the total system energy by the negative log probability of the target condition. For both of these, we add on the diffusion energy to the output of the Conditioner(s) and then backpropagate the total energy through all intermediate transformations to compute the unconstrained forces that are compatible with generic sampling SDE such as annealed Langevin Dynamics.
We schematize this overall Conditioners framework below.
It is simple to develop new conditioners. A Conditioner
is a Pytorch nn.Module
which takes in the system state - i.e. the structure, energy, and diffusion time - and outputs potentially updated structures and energies as
class Conditioner(torch.nn.Module):
"""A composable function for parameterizing protein design problems.
"""
def __init__(self, *args, **kwargs):
super().__init__()
# Setup your conditioner's hyperparameters
def forward(
self,
X: torch.Tensor, # Input coordinates
C: torch.LongTensor, # Input chain map (for complexes)
O: torch.Tensor, # Input sequence (one-hot, not used)
U: torch.Tensor, # Input energy (one-hot, not used)
t: Union[torch.Tensor, float], # Diffusion time
):
# Update the state, e.g. map from an unconstrained to constrained manifold
X_update, C_update = update_state(X, C, t)
# Update the energy, e.g. add a restraint potential
U_update = U + update_energy(X, C, t)
return X_update, C_update, O, U_update, t
Roughly speaking, Conditioner
s are composable by construction because their input and output type signatures are matched (i.e. they are an endomorphism). So we also simply build conditioners from conditioners by "stacking" them much as we would with traditional neural network layer developemnt. With the final Conditioner
as an input, Chroma.sample()
will then leverage Pytorch's automatic differentiation facilities to automaticallly furnish a diffusion-annealed MCMC sampling algorithm to sample with this conditioner (We note this isn't magic and taking care to scale and parameterize appropriately is important).
The code snippet below shows how in a few lines of code we can add a conditioner that stipulates the generation of a 2D crystal-like object, where generated proteins are arrayed in an M x N
rectangular lattice.
import torch
from chroma.models import Chroma
from chroma.layers.structure import conditioners
class Lattice2DConditioner(conditioners.Conditioner):
def __init__(self, M, N, cell):
super().__init__()
# Setup the coordinates of a 2D lattice
self.order = M*N
x = torch.arange(M) * cell[0]
y = torch.arange(N) * cell[1]
xx, yy = torch.meshgrid(x, y, indexing="ij")
dX = torch.stack([xx.flatten(), yy.flatten(), torch.zeros(M * N)], dim=1)
self.register_buffer("dX", dX)
def forward(self, X, C, O, U, t):
# Tesselate the unit cell on the lattice
X = (X[:,None,...] + self.dX[None,:,None,None]).reshape(1, -1, 4, 3)
C = torch.cat([C + C.unique().max() * i for i in range(self.dX.shape[0])], dim=1)
# Average the gradient across the group (simplifies force scaling)
X.register_hook(lambda gradX: gradX / self.order)
return X, C, O, U, t
chroma = Chroma().cuda()
conditioner = Lattice2DConditioner(M=3, N=4, cell=[20., 15.]).cuda()
protein = chroma.sample(
chain_lengths=[70], conditioner=conditioner, sde_func='langevin',
potts_symmetry_order=conditioner.order
)
protein.to_CIF("lattice_protein.cif")
An attractive aspect of this conditioner framework is that it is very general, enabling both constraints (which involve operations on
If you use Chroma in your research, please cite:
J. B. Ingraham, M. Baranov, Z. Costello, K. W. Barber, W. Wang, A. Ismail, V. Frappier, D. M. Lord, C. Ng-Thow-Hing, E. R. Van Vlack, S. Tie, V. Xue, S. C. Cowles, A. Leung, J. V. Rodrigues, C. L. Morales-Perez, A. M. Ayoub, R. Green, K. Puentes, F. Oplinger, N. V. Panwar, F. Obermeyer, A. R. Root, A. L. Beam, F. J. Poelwijk, and G. Grigoryan, "Illuminating protein space with a programmable generative model", Nature, 2023 (10.1038/s41586-023-06728-8).
@Article{Chroma2023,
author = {Ingraham, John B. and Baranov, Max and Costello, Zak and Barber, Karl W. and Wang, Wujie and Ismail, Ahmed and Frappier, Vincent and Lord, Dana M. and Ng-Thow-Hing, Christopher and Van Vlack, Erik R. and Tie, Shan and Xue, Vincent and Cowles, Sarah C. and Leung, Alan and Rodrigues, Jo\~{a}o V. and Morales-Perez, Claudio L. and Ayoub, Alex M. and Green, Robin and Puentes, Katherine and Oplinger, Frank and Panwar, Nishant V. and Obermeyer, Fritz and Root, Adam R. and Beam, Andrew L. and Poelwijk, Frank J. and Grigoryan, Gevorg},
journal = {Nature},
title = {Illuminating protein space with a programmable generative model},
year = {2023},
volume = {},
number = {},
pages = {},
doi = {10.1038/s41586-023-06728-8}
}
The Chroma codebase is the work of many contributers at Generate Biomedicines. We would like to acknowledge: Ahmed Ismail, Alan Witmer, Alex Ramos, Alexander Bock, Ameya Harmalkar, Brinda Monian, Craig Mackenzie, Dan Luu, David Moore, Frank Oplinger, Fritz Obermeyer, George Kent-Scheller, Gevorg Grigoryan, Jacob Feala, James Lucas, Jenhan Tao, John Ingraham, Martin Jankowiak, Max Baranov, Meghan Franklin, Mick Ward, Rudraksh Tuwani, Ryan Nelson, Shan Tie, Vincent Frappier, Vincent Xue, William Wolfe-McGuire, Wujie Wang, Zak Costello, Zander Harteveld.
Copyright Generate Biomedicines, Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this code except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. See the License for the specific language governing permissions and limitations under the License.
Chroma weights are freely available to academic researchers and non-profit entities who accept and agree to be bound under the terms of the Chroma Parameters License. Please visit the weights download page for more information. If you are not eligible to use the Chroma Parameters under the terms of the provided License or if you would like to share the Chroma Parameters and/or otherwise use the Chroma Parameters beyond the scope of the rights granted in the License (including for commercial purposes), you may contact the Licensor at: licensing@generatebiomedicines.com.