A tool for comparing apples to oranges.
- Python 3.7 or higher (>=3.10 recommended)
- NumPy
- MDAnalysis
- Biopython (already a requirement for MDAnalysis)
Either install them manually, or follow the instructions below to let pip take care of that for you.
Just run
pip install .
in the project root. This will install all dependencies and install the package. The command line tool should now be runnable as
apples2apples [command] [options]
Alternatively, you can make manually sure the requirements are installed, and then just run the tool without instalilng as
/path/to/project/apples2apples/apples2apples [command] [options]
To start off, you'll need (at least) two structure files for the sequences to compare. If you want to make trajectories of the matching atomsgouprs, fitted for optimal tranlation and rotation, you should of course also have the original trajectories.
Next, you'll need to write an input file, which tells the program what to do. To get an example of the input syntax, you can run
apples2apples input_syntax
which will print a basic example to stdout.
Let's take a look at a simple two-chain comparison, of the systems included in input_files
. Here is the contents of the example input included in the project root:
input_pdb | input_xtc | selection | output_ndx | output_traj
input_files/ampa.pdb | input_files/ampa.xtc | chainID A | output_files/ampa.ndx | output_files/ampa_fitted.xtc
input_files/nmda.pdb | input_files/nmda.xtc | chainID B | output_files/nmda.ndx | output_files/nmda_fitted.xtc
All lines starting with a #
are considered comments and those line, as well as empty lines are ignored. After these, the first line with content is considered to be the column titles. Column order is not important, as each column is identified by its name. Columns are separated by the pip sign. For the sequence alignment, the input_pdb
and output_ndx
are the only required fields. If the selection is not give, it is assumed to be "all"
. The output_traj
is not used for the first step, only for the spatial fitting. Another optional field that could be added is the output_pdb
, where the resulting structure of the selection after alignments would be written. The input_xtc
is not used for this command yet.
The tool can be run as
apples2apples align -i infile.txt -t Temp
where the compulsory command line option -i
specifies the input file and the optional option -t
the "temporary" directory, where the fasta files wil be written. If the temporary directory is not given, they will be written in the current directory.
For each line after the column titles, the program reads in the structure from input_pdb
and makes a selection using the selection string in the selection
column. These should be valid MDAnalysis selection strings. The selections are converted to aminoacid sequences and written to unaligned.fasta
. Alignment is run on the fasta file, and the result is stored in aligned.fasta
. After this the program reads the aligned sequences and constructs selections, where matching residues are chosen as a whole, and nonmatching ones either just the alpha carbons or the backbone. The selections are written to output_ndx
and optionally to output_pdb
.
The next logical step is to extract the selections form the original tarjectories and fit them for optimal RMSD. The same input file as before can be reused, where also the input_xtc
column is now compulsory. The selection
and output_pdb
are ingored. Note, that here the output_ndx
is actually used as input, but is named as such, to allow reusing the input file from the previous command.
The next command can be run as
apples2apples fit -i infile.txt
The program reads in the input_pdb
and selections in output_ndx
and, using the first frame of the first structure as the reference, translates and rotates the structures to minimise RMSD. The trajectories in input_xtc
are then read and rewritten to output_traj
. Also a pdb and index file is made for each new trajectory, with the suffix changed to .ndx
or .pdb
(the reason for this is outlined below in the advanced usage part).
Reading in one chain per system is as simple as seen above, even for more than two systems. A bit more copmlex situation arises when we want to extract multiple chains from single trajectories. For example we can the the AMPA and NMDA receptors, which both have four chains. We want to make the sequence alignment separately for each chain, but wan the final fitting and the structures as single files. This is why the fit
command considers any chains coming from the same pdb file as being from the same system. The selection that is fitted and written to output_traj
is made by concatenating the selection from the different chains in the order they are in the input file.
For example let's consider this input file
input_pdb | input_xtc | selection | output_ndx | output_traj
input_files/ampa.pdb | input_files/ampa.xtc | chainID A | output_files/ampa_A.ndx | output_files/ampa_fitted.xtc
input_files/ampa.pdb | input_files/ampa.xtc | chainID B | output_files/ampa_B.ndx | output_files/ampa_fitted.xtc
input_files/ampa.pdb | input_files/ampa.xtc | chainID C | output_files/ampa_C.ndx | output_files/ampa_fitted.xtc
input_files/ampa.pdb | input_files/ampa.xtc | chainID D | output_files/ampa_D.ndx | output_files/ampa_fitted.xtc
input_files/nmda.pdb | input_files/nmda.xtc | chainID B | output_files/nmda_B.ndx | output_files/nmda_fitted.xtc
input_files/nmda.pdb | input_files/nmda.xtc | chainID C | output_files/nmda_C.ndx | output_files/nmda_fitted.xtc
input_files/nmda.pdb | input_files/nmda.xtc | chainID D | output_files/nmda_D.ndx | output_files/nmda_fitted.xtc
input_files/nmda.pdb | input_files/nmda.xtc | chainID A | output_files/nmda_A.ndx | output_files/nmda_fitted.xtc
The alignment will be made separately on each chain, so that all chains will share the same number of atoms. However in the fitting command all four AMPA receptor chains are considered as a part of a single system (meaning that only the last input_xtc
and output_traj
are actually used) and the output is made by concateneating the chains in ABCD order. Similarily for NMDA they are considered to be the same as the input_pdb
are the same string, but they are read in as BCDA order.
To do the alignment separately different chains, we sould need separate input files for the laignment and fitting. For example in the above case, the AMPAR has four identical chains, while NMDAR has A and C chains matching, as well as B and D. As such we would normally want to do alignment of the two different NMDAR chains separately as we are only interested in their similarities to AMPAR, not between themselves. For the alignment we would have two input files, the first one as follows:
input_pdb | input_xtc | selection | output_ndx | output_traj
input_files/ampa.pdb | input_files/ampa.xtc | chainID A | output_files/ampa_A.ndx | output_files/ampa_fitted.xtc
input_files/ampa.pdb | input_files/ampa.xtc | chainID C | output_files/ampa_C.ndx | output_files/ampa_fitted.xtc
input_files/nmda.pdb | input_files/nmda.xtc | chainID B | output_files/nmda_B.ndx | output_files/nmda_fitted.xtc
input_files/nmda.pdb | input_files/nmda.xtc | chainID D | output_files/nmda_D.ndx | output_files/nmda_fitted.xtc
And similarily for the two other pairs of chains. For the fitting we could use the same input as before.