/beast2-xml

Python code to generate BEAST2 XML files

Primary LanguagePythonMIT LicenseMIT

BEAST2 XML

This is a first, and very simplistic, cut at generating BEAST2 XML from Python (2.7, and 3.5 to 3.10 are all known to work).

BEAST2 is a complex program and so is its input XML. People normally generate the input XML using a GUI tool, BEAUti. BEAUti is also a complex tool, and the XML it generates can vary widely. Because BEAUti is a GUI tool it's not possible to use it to programmatically generate XML.

I wrote beast2-xml because I wanted a way to quickly and easily generate BEAST2 XML files, from the command line and also from my Python code.

There are a lot of things that could be added to this code! Contributions welcome.

The package provides

  • A command-line script (bin/beast2-xml.py) to generate BEAST2 XML files.
  • A simplistic Python class (in beast2xml/beast2.py) that may be helpful if you are writing Python that needs to generate BEAST2 XML files. (This Python class is of course used by the command line script.)

Installation

$ pip install beast2-xml

You can also get the source from PyPI or on Github.

Generate XML from the command line

You can use bin/beast2-xml.py to quickly generate BEAST2 XML. You must provide the sequences for the analysis (as FASTA or FASTQ), either on standard input or using the --fastaFile option.

Run beast2-xml.py --help to see currently supported options:

$ beast2-xml.py --help
usage: beast2-xml.py [-h] [--clockModel MODEL | --templateFile FILENAME]
                     [--chainLength LENGTH] [--age ID=N [ID=N ...]]
                     [--defaultAge N] [--dateUnit UNIT]
                     [--dateDirection DIRECTION]
                     [--logFileBasename BASE-FILENAME] [--traceLogEvery N]
                     [--treeLogEvery N] [--screenLogEvery N] [--mimicBEAUti]
                     [--sequenceIdDateRegex REGEX]
                     [--sequenceIdAgeRegex REGEX]
                     [--sequenceIdRegexMayNotMatch] [--fastaFile FILENAME]
                     [--readClass CLASSNAME] [--fasta | --fastq | --fasta-ss]

Given FASTA on stdin (or in a file via the --fastaFile option), write an XML
BEAST2 input file on stdout.

optional arguments:
  -h, --help            show this help message and exit
  --clockModel MODEL    Specify the clock model. Possible values are 'random-
                        local', 'relaxed-exponential', 'relaxed-lognormal', or
                        'strict' (default: strict)
  --templateFile FILENAME
                        The XML template file to use. (default: None)
  --chainLength LENGTH  The MCMC chain length. (default: None)
  --age ID=N [ID=N ...]
                        The age of a sequence. The format is a sequence id, an
                        equals sign, then the age. For convenience, just the
                        first part of a full sequence id (i.e., up to the
                        first space) may be given. May be specified multiple
                        times. (default: None)
  --defaultAge N        The age to use for sequences that are not explicitly
                        given an age via --age. (default: 0.0)
  --dateUnit UNIT       Specify the date unit. Possible values are 'day',
                        'month', or 'year'. (default: year)
  --dateDirection DIRECTION
                        Specify whether dates are back in time from the
                        present or forward in time from some point in the
                        past. Possible values are 'forward' or 'backward'.
                        (default: backward)
  --logFileBasename BASE-FILENAME
                        The base filename to write logs to. A ".log" or
                        ".trees" suffix will be appended to this to make
                        complete log file names. (default: beast-output)
  --traceLogEvery N     How often to write to the trace log file. (default:
                        2000)
  --treeLogEvery N      How often to write to the tree log file. (default:
                        2000)
  --screenLogEvery N    How often to write logging to the screen (i.e.,
                        terminal). (default: 2000)
  --mimicBEAUti         If specified, add attributes to the <beast> tag that
                        mimic what BEAUti uses so that BEAUti will be able to
                        load the XML. (default: False)


  --sequenceIdDateRegex REGEX
                        A regular expression that will be used to capture sequence
                        dates from their ids. The regular expression must have three
                        named capture regions ("year", "month", and "day"). Regular
                        expression matching is anchored to the start of the id string
                        (i.e., Python's re.match function is used, not the re.search
                        function), so you must explicitly match the id from its beginning.
                        For example, you might use
                        --sequenceIdDateRegex '^.*_(?P<year>\d\d\d\d)-(?P<month>\d\d)-(?P<day>\d\d)'.
                        (default: None)
  --sequenceIdAgeRegex REGEX
                        A regular expression that will be used to capture sequence ages
                        from their ids. The regular expression must have a single capture
                        region. Regular expression matching is anchored to the start of
                        the id string (i.e., Python's re.match function is used, not the
                        re.search function), so you must explicitly match the id from its
                        beginning. For example, you might use --sequenceIdAgeRegex '^.*_(\d+)$'
                        to capture an age preceded by an underscore at the very end of the
                        sequence id. If --sequenceIdDateRegex is also given, it
                        takes precedence when matching sequence ids. (default: None)
  --sequenceIdRegexMayNotMatch
                        If specified (and --sequenceIdDateRegex or --sequenceIdAgeRegex is given)
                        it will not be considered an error if a sequence id does not
                        match the given regular expression. In that case, sequences will be assigned
                        an age of zero unless one is given via --age. (default: False)
  --fastaFile FILENAME  The name of the FASTA input file. Standard input will
                        be read if no file name is given.
  --readClass CLASSNAME
                        If specified, give the type of the reads in the input.
                        Possible choices: SSAARead, DNARead, TranslatedRead,
                        RNARead, SSAAReadWithX, AAReadORF, Read, AARead,
                        AAReadWithX. (default: DNARead)
  --fasta               If specified, input will be treated as FASTA. This is
                        the default. (default: False)
  --fastq               If specified, input will be treated as FASTQ.
                        (default: False)
  --fasta-ss            If specified, input will be treated as PDB FASTA
                        (i.e., regular FASTA with each sequence followed by
                        its structure). (default: False)

As mentioned, this is extremely simplistic. If you need to generate more complex XML, you can pass in a template file using --templateFile. Your template will need to have a high-level structure that's similar to those produced by BEAUti, otherwise the various command-line options for manipulating the template wont find what they need (you'll see an error message in this case).

If you don't pass a template file name, a default will be chosen based on the clock model (strict by default). The default templates all come from BEAUti, so if you generate a template yourself using BEAUti, you can almost certainly pass it to beast2-xml.py to use as a basis to create variants from.

Note that the generated XML contains just the first part of sequence ids in the given FASTA input. I'm not sure if this is a requirement, but it's what BEAUti does and so I have done the same.

Generate BEAST2 XML in Python

If you want to create BEAST2 XML from your own Python, you can use the BEAST2XML class defined in beast2xml/beast2.py.

The simplest possible usage is

from beast2xml import BEAST2XML

print(BEAST2XML().toString())

There are several options you can pass to the BEAST2XML constructor:

class BEAST2XML(object):
    """
    Create BEAST2 XML.

    @param template: A C{str} filename or an open file pointer to read the
        XML template from. If C{None}, a template based on C{clockModel}
        will be used.
    @param clockModel: A C{str} specifying the clock model. Possible values
        are 'random-local', 'relaxed-exponential', 'relaxed-lognormal',
        and 'strict.
    @param sequenceIdDateRegex: If not C{None}, gives a C{str} regular
        expression that will be used to capture sequence dates from their ids.
        See the explanation in ../bin/beast2-xml.py
    @param sequenceIdAgeRegex: If not C{None}, gives a C{str} regular
        expression that will be used to capture sequence ages from their ids.
        See the explanation in ../bin/beast2-xml.py
    @param sequenceIdRegexMustMatch: If C{True} it will be considered an error
        if a sequence id does not match the regular expression given by
        C{sequenceIdDateRegex} or C{sequenceIdAgeRegex}.
    """

and options you can pass to its toString method:

def toString(self, chainLength=None, defaultAge=0.0, dateUnit='year',
             dateDirection='backward', logFileBasename=None,
             traceLogEvery=None, treeLogEvery=None, screenLogEvery=None,
             transformFunc=None, mimicBEAUti=False):
    """
    @param chainLength: The C{int} length of the MCMC chain. If C{None},
        the value in the template will be retained.
    @param defaultAge: The C{float} age to use for sequences that are not
        explicitly given an age via C{addAge}.
    @param dateDirection: A C{str}, either 'backward' or 'forward'
        indicating whether dates are back in time from the present or
        forward in time from some point in the past.
    @param logFileBasename: The C{str} The base filename to write logs to.
        A .log or .trees suffix will be appended to this to make the
        actual log file names.  If C{None}, the log file names in the
        template will be retained.
    @param traceLogEvery: An C{int} specifying how often to write to the
        trace log file. If C{None}, the value in the template will be
        retained.
    @param treeLogEvery: An C{int} specifying how often to write to the
        tree log file. If C{None}, the value in the template will be
        retained.
    @param screenLogEvery: An C{int} specifying how often to write to the
        terminal (screen) log. If C{None}, the value in the template will
        be retained.
    @param transformFunc: If not C{None} A callable that will be passed
        the C{ElementTree} instance and which must return an C{ElementTree}
        instance.
    @param mimicBEAUti: If C{True}, add attributes to the <beast> tag
        in the way that BEAUti does, to allow BEAUti to load the XML we
        produce.
    @raise ValueError: If any required tree elements cannot be found
        (raised by our call to self.findElements).
    @return: C{str} XML.
    """

An example of using the Python class can be found in the beast2-xml.py script. Small examples showing all functionality can be found in the tests in test/testBeast2.py.

Development

To run the tests:

$ make check

or if you have Twisted installed, you can use its trial test runner, via

$ make tcheck

You can also use

$ tox

to run tests for various versions of Python.