Author: Vaidyanathan M. Sethuraman.
ver_1.2: Nov-05-2021
If you find this software useful, please cite it in your work as Sethuraman et al., Atomistic Simulations of Polydisperse Lignin Melts Using Simple Polydisperse Residue Input Generator, Biomacromolecules (2023) Citation.
A Python/FORTRAN-90 based code to generate random initial structures for polydisperse monomers/residues according to Schulz-Zimm distribution or from an experimental molecular weight distribution.
In conjunction with LigninBuilder, users can generate
generate starting .psf
, .pdb
and .top
files for
running with NAMD or
GROMACS software.
Although this code was primarily designed to generate input files for various types of biomass (Lignin/Carbohydrates), it can be used with any protein or polymer complex with the forcefield files.
This code works on a combination of Python
and FORTRAN90
platforms. Python3.0+
and ifort
/gfortran
compilers are required.
Dowloading and unzipping the directory can be done using:
git clone https://github.com/vaidyanathanms/SPRInG_PolydispersePolymerBuilder.git
The above commands should generate a folder of the name:
- generic_builder
Navigate to generic_builder
for generating a structure from scratch
using
cd generic_builder
. Inside generic_builder
folder, you should see the following python
and FORTRAN files:
- make_genpsf.py
- genconf.py
- pdi_gen.f90
- pdi_dist_params.f90
If the files are present, you are set to generate a new structure.
There are two ways to generate initial structure. You can either copy
the four files above to a new folder or run from the directory
generic_builder
.
inputsforpsfgen.inp
is a sample input file containing all the input
keywords to SPRIG. We will look at the keywords in detail in the SPRIG
Keywords section. For running SPRIG, use:
python genconf.py <filename>
where <filename>
is the name of the input file to genconf.py
.
If this generates a folder with casenum_ID
and sub-folder
all_tclfiles
(within casenum_ID
) which contain a number
of tcl
files, you are all set. Here, ID refers to an integer value
given as input (see SPRIG Keywords).
Several things can go wrong including compiler compatibilities and incompatible input constraints (see SPRIG Keywords). If you find an error, please report to Vaidyanathan M. Sethuraman.
Outputs from SPRIG can be directly fed into LigninBuilder to
generate the input structure for GROMACS
using the following
three steps. Make sure to follow the order.
-
Step 1: If SPRIG ran correctly, users should see a folder
casenum_ID
, whereID
is an integer value given as input to SPRIG (see SPRIG Keywords). Navigate to this directory usingcd <casenum_ID>
Inside the folder users should see several files viz.,
- step1.tcl
- step2.tcl
- patchlist_ID.tcl, reslist_ID.tcl
- log_ID.txt
- step3.tcl (Optional)
where
ID
corresponds to thecasenum_ID
. The directory will also contain the user specified input files for the residue/patch probabilities. If all of these files are present, Step 1 is complete. -
Step 2: Execute the following from command terminal:
vmd -dispdev text -e step1.tcl
Make sure the path to
vmd
is added to$BIN
or is given correctly.Alternatively, users can open
VMD
and openTk Console
and issuesource step1.tcl
If either of the commands run smoothly, this should generate
psf
files for each chain structure. This should also generatepdb
files corresponding to thepsf
files within the directory. This requiresLigninBuilder
to be added in~\.vmdrc
(seeLigninBuilder
on how to do this).Following the generation of
psf
`pdb` files, from command terminal, issuevmd -dispdev text -e step2.tcl
OR
from
Tk Console
inVMD
issuesource step2.tcl
This should provide output files of the form
inpfile_nch_N.psf
,inpfile_nch_N.pdb
where inpfile is the name of the input system (see SPRIG Keywords) and N corresponds to the number of chains in the system. If LigninBuilder flag is ON, it should also generate an outputtop
file of the forminpfile_nch_N.top
. Make sure theprm
file is in the folder (casenum_ID) for generatingtop
files.If the
pdb
file(s) is (are) not generated, please see the input constraints. Most likely a particular residue (or patch) is incompatible. Please make sure thatLigninBuilder
is added to~\.vmdrc
before executing this command. -
Step 3: If the intent of the user is to generate an initial structure which do not have overlapping atoms, then the
minimizestructure
module inLigninBuilder
is necessary. To this end, issue from command line,vmd -dispdev text -e step3.tcl
OR
from
Tk Console
inVMD
issuesource step3.tcl
For some cases, the codes would require to use
findmissingterms.py
in LigninBuilder (seeLigninBuilder
module for more details). If the run is succesful, this will generate apdb
andpsf
file which are compatible withGROMACS
/NAMD
NOTE: It is the user's responsibility to check whether the
parameters match the atomnames (atomtypes) in the psf
/pdb
files.
In this section, we look at the different keywords that are needed to generate a polydisperse input structure.
- Some keywords are optional and are prefixed with (Optional) while introducing the keyword.
- A space/tab should be present between the keywords and arguments or between arguments.
- A new line can start with an optional
#
. These lines will be ignored. However,#
cannot be used in the middle of a line. - The input file name (and file names used as arguments) cannot be
specified as
None
ornone
. These are reserved keywords within the program.
-
case_num (Optional)
All input files can start with an optional
case_num
keyword. If this is used as a keyword, it should be the first keyword in the input file. Usage:case_num caseID
caseID
should be a positive integer. This will create a folder of the namecasenum_caseID
where all the output files will be present. Default value forcaseID
is 1.Example:
case_num 1
-
biomass_type
This is a mandatory keyword and corresponds to the prefix for output file. Usage:
biomass_type argname
The final tcl files generated will be of the form argname_1_nch.tcl
, where
nch` corresponds to number of chains in the system.Example:
biomass_type switchgrass
-
num_resids
This is a mandatory keyword and corresponds to the average number of residues per chain (segment). Usage:
num_resids nres
where
nres
corresponds to the average number of residues per chain (segments or monomers per chain). Should be an integer value.Example:
num_resids 20
-
num_chains
Mandatory keyword corresponding to the number of chains in the system. Usage:
num_chains nch
where
nch
corresponds to the number of chains in the system (integer value).Example:
num_chains 10
-
disperse (Optional)
This keyword dictates the polydispersity of the system. If this option is not provided, chains are assumed as monodisperse by default. If the option is provided, chains will be drawn from a Schulz-Zimm distribution. There are two options (and suboptions) for this case. Usage:
disperse maketype optkeywords optargs
maketype
can be eitherSZTHEORY
,EXPTDATA
orREADDATA
.SZTHEORY
generates a set of polydisperse chains using theoretical Schulz-Zimm distribution (see below for options).EXPTDATA
generates a set of polydisperse chains according to the experimental data (curves) for the molecular weight distribution of chains (see below).READDATA
reads a file containing the molecular weights (degree of polymerization) of all the chains from a file (see below for format). Arguments for each option are elaborated below.Examples:
disperse SZTHEORY 1.50 polydisp.inp 10000 disperse sztheory 1.50 polydisp.inp 1000 pditol 8.0 mwrange 20 disperse readdata molwtdata.dat disperse exptdata exptdata.dat mwmonomer 180 ntrials 20 disperse EXPTDATA WTdata.dat mwmonomer 200 ntrials 1000 pditol 3
-
SZTHEORY
For this case a new file will be generated according to the polydispersity value and the number of chains/number of residues per chain using a Schulz-Zimm distribution. Usage for this option is as follows:disperse sztheory PDIval Outputfile ntrials pditol tolerance distrange rangeval
PDIval
andOutputfile
corresponds to the target polydispersity value and the output file containing the molecular weights (degree of polymerization) of each chain.PDIval
is the target dispersity index and is defined as the ratio between the weight average molecular weight and number average molecular weight. This number *should be greater than 1.0.Outputfile
is the name of the file that is generated where the degree of polymerization of each chain is written out.ntrials
correspond to the number of trials the program attempts to generate the polymer chains within the PDI tolerance limit. Values between 1000-10000 should be enough for most cases.With the pditol keyword, user can also specify an optional tolerance value (0-100). This corresponds to the maximum relative error (in %) between the target PDI value and the simulated PDI. Different combinations will be tried to obtain either the target PDI value of the system. Default value is 5. For all practical purposes values between 5 and 15 yield good output distribution if the number of chains in the system is less than 20. This keyword and the corresponding argument is optional.
With the distrange keyword, user can specify the maximum range of molecular weights that will be used to create the Schulz-Zimmm distribution. Plot the theoretical curve to see where the distribution tapers to zero. Most likely the default value of 5 would suffice. This keyword and the corresponding argument is optional.
If this option is used, after running the program, a file with the name 'geninp_pdistruct.txt' will be generated and it will contain the details of the inputs.
-
EXPTDATA
With this option, users can input the molecular weight distribution obtained from experiments. The program then uses this molecular weight distribution data to generate chains for simulations. Usage:disperse EXPTDATA inpfilename mwmonomer 200 ntrials 1000 pditol 3
where
inpfilename
corresponds to the molecular weight distribution data file. Theinpfilename
should contain only two columns of data. Further, one of the columns should have a headermolwt
which corresponds to the molecular weights of the sample distribution.For the second column, the code accepts one of the following three options:
wlogmw
,wmw
, orpmw
.wlogmw
corresponds to the distribution data (w(logM) = dm/dlog(M)); where m is the total mass and M is the molecular weight of the chains.wmw
corresponds to the distribution data (w(M) = dm/dM = dm/(M*dlog(M)) = w(logM)/M).wmw
is the weight averaged probability distribution.pmw
corresponds to the number averaged probability distribution and is related to the other quantities through, p(M) = w(M)/M = w(logM)/$M^2$Users should input only one of the above three options for the distribution. Most likely, experimental data are reported in w(logM), wheras other options are more common in theory/computation literature.
Keywords
mwmonomer
,ntrials
andpditol
are optional.mwmonomer
corresponds to the average molecular weight of one monomer in g/mol. Default value is 200 g/mol.ntrials
corresponds to the number of attempts, random samples are drawn from the experimental distribution before both the average number molecular mass and the PDI converges to a tolerance ofpditol
of the experimental distribution.pditol
is the relative tolerance in % (between 0 and 100%). Default values forntrials
andpditol
are 100000 and 5%, respectively. -
READDATA
Users can also specify a file where the degree of polymerization of each chain is specified. In this case, the program will directly read this file and create the segments. Usage:disperse READDATA inpfilename
where
inpfilename
is the name of the file. Theinpfilename
should have the following structure. First line *should have the following structure:num_chains nchains
where nchains correspond to the number of chains in the system. This should be consistent with the
num_chains
in the input file used to rungenconf.py
. The nextn
lines should correspond to the degree of polymerization of then
different chains.
-
-
top_ipfile
Mandatory keyword and the argument corresponds to the path to the topology file. It is the user's responsibility to check whether the residues generated have their monomer structure in the topology file. User should also provide the full path to the topology file. Default assumption is that the file is present in the path from which
genconf.py
is called. Usage:top_ipfile filename
Example:
top_ipfile top_lignin.top
-
resid_inp
Mandatory keyword and the argument corresponds to the average probability of each residue in the system. It should be provided in a file with each line corresponding to the residue name and the average probability. Users should make sure that the residue name matches with the residue name in the topology file. Usage:
resid_inp filename
Example for formatting filename:
SYR 0.4 TRCN 0.05 GUAI 0.3 PCA 0.15 FERUT 0.1
NOTE: The sum of the probabilities need not be one. Code internally makes the sum to be one. However, a warning will be issued if the sum is not one. The inputs should contain the details for the branch (graft) monomers or else the code will not recongnize any branch monomer.
-
patch_inp
Mandatory keyword and the argument corresponds to the average probability of each patch in the system. It should be provided in a file with each line corresponding to the residue name and the average probability. Users should make sure that the patch name matches with the patch name in the topology file. Usage:
resid_inp filename
Example for formatting filename:
BO4 0.8 B5 0.1 BB 0.05 AO4 0.05
NOTES:
-
The sum of the probabilities need not be one. Code internally makes the sum to be one. However, a warning will be issued if the sum is not one.
-
The inputs should NOT contain the details for the branch (graft) patches. DO NOT provide patch details for branch monomers here. This is different from inputting residues where the name of the branched residue should be present.
-
If you are using SPRIG with LigninBuilder, please be aware that residues for which there exists equal probability for the tacticities (e.g. BO4R and BO4L for BO4), DO NOT give separate probabilities for each stereoisomer. Provide the overall probability. LigninBuilder will make sure that all the stereoisomers have equal probability.
-
-
seg_name
Mandatory keyword which corresponds to the name of the segment. Except for one case (see NOTE below), this will serve as the prefix for segment names for different chains in the
psf
file output. Usage:seg_name argname
The output
psf
name will have segment names of the formargname_chainID
wherechainID
is an integer varying from 1 to number of chains in the system.NOTE: In case, an input PDB file is given to generate the PDB file using
genconf.py
and NOT LigninBuilder, users must make sure that the segment name matches the segment name in the input PDB file that is used to generate the initial guesses for the initial coordinates (ICs). -
op_style (Optional)
Keyword dictating the output style. There are two argument options --
single
andmulti
. Usage:op_style single op_style multi 4
For the argument
single
, a single ouputtcl
file will be generated per chain. On running this withVMD
orLigninBuilder
, you can produce the final structure. However, the final structure may have unphysical bonds. It is the user's responsibility to check this. If the user is combining this code withLigninBuilder
please usesingle
option sinceLigninBuilder
has capabilities to untangle unphysical bonds.For the argument
multi
, the program breaks down thetcl
files into smallertcl
files. This requires an extra integer argument. Let us say that we are creating a polymer with degree of polymerization 20 and the extra argument is 4, thetcl
file corresponding to this chain will have 5 different builds. First, the first four segments of the chain are built. Then NAMD is called to minimize the structure. The minimized structure is then used as an input to generate the next 4 residues -- so on and so forth. This requiresNAMD
path to be added correctly or else runningtcl
file may encounter errors.NOTE: Use the option
single
withLigninBuilder
. Default issingle
. -
branching (Optional)
To define branching of main chain. Branches are single monomer long in the current mode. The program can manage multiple types of branches. Usage:
branching 1 branch1 patch1 branch2 patch2 ...
The keyword
branching
should be followed by an integer 1 or 0. 1 corresponds to turning on the branch and 0 corresponds to no branch. This gives the user to toggle between branched and non-branched system easily.The branch1/patch1 pair corresponds to the name of the branch residue and the patch connecting the branch with the backbone. The inputs should always be given as pairs. Users must make sure that the residue names are already present in the input list for residues.
Example:
branching 1 PCA GOG FERUT GOG
NOTE: Since, by construction, the number of patches equal to the number of residues in the system, the final probabilities for the patch values may not reflect the input values.
-
nattempts (Optional)
Number of attempts to achieve a random configuration that corresponds to the input probabilities for residues and patches. Each time a better target configuration (smaller residual error) is found, the program saves that configuration. In case, the target residual error is not met within
nattempts
, the best configuration along with the residual error will be generated as output. Usage:nattempts intvalue
where intvalue is the number of attempts. A value between 50 and 200 for an average degree of polymerization of 30 works generates a target onfiguration in a few minutes.
Example:
nattempts 60
Default value for
nattempts
is 50. -
tol (Optional)
Relative tolerance between the input probabilities for residues/patches and averaged output values for residues/patches. L2norm is used to calculate the relative error. Program runs until the error is less than tolerance value or the number of attempts exceeds
nattempts
. Usage:tol tolval
where
tolval
is a number between 0 and 1. Nominal values are between 0.05 and 0.15. Default value is 0.1. -
terminator (Optional)
Use this option to make sure that certain fraction of the chains end with this type of residue. This residue should be present in the input residue list. The final fraction of this residue will have an average probability closer to the value input to the program. Usage:
terminator resname
where
resname
is the name of the residue.Example
terminator TRCN
-
LigninBuilder (Optional)
Use this option to generate output files that can be used in conjunction with
LigninBuilder
software to generate*.top
files forGROMACS
software. Usage:LigninBuilder filename
where
filename
corresponds to the file that contains the details of the potentials. Usually this will have a.prm
extension. Please make sure to provide the full path to the fil unless it is in same directory asgenconf.py
.Example:
LigninBuilder par_lignin.prm
-
clean_directories (Optional)
Use this option to clean the existing output directory (
casenum_ID
) and replace with new files. Usage:clean_directories Y
Arguments can be
Y
(yes) orN
(no). Default isN
.WARNING: All files will be deleted before the new output files are written into the directory.
-
patch_patch_constraint (Optional)
Optional argument to specify the constraints between adjacent patches. This is useful to let the program know that patches (linkers) cannot be next to each other. For instance, a patch (linker) of type
$\beta$ -5 cannot be followed by a 55 patch (linker) since the 5th position is already occupied. The constraints need to be specified in a separate file. Usage:patch_patch_constraint <filename>
where
<filename>
contains the patch and the constraints. A sample example of this file will be as follows:55 55 5B BB BB GOG B5 BO4 B5 55 GOG BB BO4 O4B 4O5
The first entry of each row should be the first of the two consecutive patches. Next 'n' entries of the row should contain all the patches that are incompatible with the first entry.
Rules of making a new row is as follows: first element in every row corresponds to the patch between residues
i
andi+1
whereas the other elements in the rows correspond to the EXCLUDED patch values between residuei+1
and residuei+2
. For instance, let us say that aBO4
patch between residues 1 and 2, this DOES NOT preclude having the patch BO4 between 2 and 3. Therefore, for the row ofBO4
, there should not be BO4 in the EXCLUDED values. In other words, the number of occurences ofBO4
for the row ofBO4
should be exactly 1. Now, for a different case, let us say that a patchBB
is present between residues 1 and 2. Therefore, the$\beta$ position of residue 2 is filled and we CANNOT have a patch which starts with "\beta" such as "BB" or "B5" or "BO4". In this case BB should be repeated twice (including the first column of that row).Finally, if there are branches (grafts) present, one needs to take extra care to include the exclusions.
NOTES
-
Each row is independent. In other words, if
pat_2
is incompatible withpat_1
does not mean thatpat_1
is incompatible withpat_2
. This incompatibility (if necessary) should be specified in separate rows. -
The patch names should exactly match the names of the patches in the input patch list given using
patch_inp
. If the patch name in the input does not match what is given in the list, it will be ignored. However, it is OK for this file to have the name of the patches that is not present in the input given usingpatch_inp
. -
Although this is an optional argument, almost always there will be restrictions for real sytems. It is the user's responsiblity to make sure that all the constraints are given to the system.
-
-
patch_res_constraint (Optional)
Optional argument to specify the constraints between a patch (linker) and a residue. This is useful to let the program know that certain patches cannot succeed (or precede) a certain residue. For instance, a
55
patch can neither precede nor succeed a syringyl residue. However,405
can precede a syringyl residue whereas it cannot succeed a syringyl residue. Similar to thepatch_patch_constraint
input, the incompatibility data should be input as a file. Usagepatch_res_constraint <filename>
where
<filename>
corresponds to the name of the file comprising the incompatibilities. There are certain keywords that should be present in this file. Keywords are case-sensitive.Example:
patch restrict_before restrict_after 55 SYR SYR B5 None SYR 55 TRCN TRCN B5 TRCN TRCN
NOTES
-
The first line of this file should have the following keywords:
patch
,restrict_before
andrestrict_after
. -
Each row should contain only 3 entries. The first entry is the name of the patch. The entry under
restrict_before
should correspond to the restriction that the patch cannot precede a residue and the entry belowrestrict_after
should correspond to the restriction that the patch cannot succeed a residue. In the above example55
can neither precede nor succeed aSYR
residue. SimilarlyB5
can neither precede nor succeed aTRCN
residue. However,B5
can come beforeSYR
where as it cannot come afterSYR
. -
Use
None
keyword (case-sensitive), if a patch can precede (but not succeed) or viceversa a given residue. By default, there are no restrictions. In other words, for all the patches mentioned in the input will haveNone
as default argument for incompatibility. -
If a patch cannot come before (or after) two residues, it should be specified in two separate lines. For instance, in the example above, patch
55
should be repeated twice for the program to know that it has incompatibility with bothSYR
andTRCN
. -
Although this is an optional argument, almost always there will be restrictions for real sytems. It is the user's responsiblity to make sure that all the constraints are given to the system.
-
-
gen_packmol (to be added)
-
namd_inp (to be added)
-
pdb_ipfile (to be added)