PhyloAcc test data

This repository contains a small test dataset to run with PhyloAcc after installing.

The test dataset is based on the ratite dataset from Hu et al. 2019 and is composed of 500 simulated genomic elements each 200bp long with varying rates.

To use this test repo, first clone it by clicking the green Code button above and copying your preferred link. Then, in your shell clone the repo:

git clone [copied URL from green Code button]

Click here to see a breakdown the above command

Command line parameter	Description
git	This calls the main git program from the command line
clone	The sub-program of git used to copy repositories from the web to your local machine
[copied URL from green Code button]	The URL of the repository you wish to copy

Then enter that directory:

cd PhyloAcc-test-data

Click here to see a breakdown the above command

Command line parameter	Description
cd	The Linux change directory command
PhyloAcc-test-data	The path to the directory you want to enter

Test commands

There are two ways you can run PhyloAcc, either through the interface with phyloacc.py [options] or directly with PhyloAcc-ST [config file]. We strongly recommond using the interface to take care of batching with snakemake which will greatly increase the efficiency of running on large datasets.

To test batching with the interface

First, run the interface to generate batches and snakemake files:

phyloacc.py -a simu_500_200_diffr_2-1.fa -b simu_500_200_diffr_2-1.bed -i id-subset.txt -m ratite.mod -o phyloacc-test -t "strCam;rhePen;rheAme;casCas;droNov;aptRow;aptHaa;aptOwe;anoDid" -g "allMis;allSin;croPor;gavGan;chrPic;cheMyd;anoCar" -n 4 -batch 5 -j 2 -part "[a valid partition on your cluster]"

Click here to see a breakdown the above command

Command line parameter	Description
phyloacc.py	This tells the shell to run the phyloacc Python interface -- use `phyloacc.py -h` for a full list of options
-a simu_500_200_diffr_2-1.fa	This option specifies the path to a concatenated alignment file in FASTA format
-b simu_500_200_diffr_2-1.bed	This option specifies the path to a bed file that contains partitions for each element in the alignment file
-i id-subset.txt	This option specifies the path to a text file containing one element ID per line to subset the dataset -- PhyloAcc will only be run on these elements
-m ratite.mod	This option specifies the path to the MOD file from PHAST, which contains pre-estimated neutral substitution rates and a species tree
-o phyloacc-test	This option specifies the path to the desired output directory, which will be created if it does not exist
-t "strCam;rhePen;rheAme;casCas;droNov;aptRow;aptHaa;aptOwe;anoDid"	A semi-colon separated list of target species
-g "allMis;allSin;croPor;gavGan;chrPic;cheMyd;anoCar"	A semi-colon separated list of outgroup species
-n 4	This is the number of processes that `phyloacc.py` will use to generate the batches
-batch 5	This is the number of loci per batch
-j 2	This is the number of batches to submit to your cluster simultaneously
-part "[a valid partition on your cluster]"	This should be a partition on your cluster to which you wish to submit jobs

Next, run the resulting snakemake command that is printed to the screen. This will be different for every user, but should look something like this:

snakemake -p -s [/your/path/to]/PhyloAcc-test-data/phyloacc-test/phyloacc-job-files/snakemake/run_phyloacc.smk --configfile [/your/path/to]/PhyloAcc-test-data/phyloacc-test/phyloacc-job-files/snakemake/phyloacc-config.yaml --profile [/your/path/to]/PhyloAcc-test-data/phyloacc-test/phyloacc-job-files/snakemake/profiles/slurm_profile --dryrun

Click here to see a breakdown the above command

Command line parameter	Description
snakemake	A program that handles workflows and cluster job submission; this should be installed automatically when you install PhyloAcc from bioconda
-p	This option tells snakemake to print the shell commands that are executed
-s [/your/path/to]/PhyloAcc-test-data/phyloacc-test/phyloacc-job-files/snakemake/run_phyloacc.smk	This option specifies the path to the snakemake file; this is written automatically by `phyloacc.py`
--configfile [/your/path/to]/PhyloAcc-test-data/phyloacc-test/phyloacc-job-files/snakemake/phyloacc-config.yaml	The path to the config file for a given workflow; this is written automatically by `phyloacc.py`
--profile [/your/path/to]/PhyloAcc-test-data/phyloacc-test/phyloacc-job-files/snakemake/profiles/slurm_profile	The path to a directory containing a cluster profile so snakemake knows how to submit jobs to your cluster; this is written automatically by `phyloacc.py`
--dryrun	The --dryrun option tells snakemake to report the commands it will execute without actually executing them

Once there are no errors and you are satisfied this will run the batches you want, remove the --dryrun option to run the batches with Snakemake. Since this is a very small dataset, it should only take a couple of minutes, but for large datasets you would want to run this in the background or submit the snakemake command as its own job.

Finally, after the batches have completed, you will want to gather the outputs with the post-processing script:

phyloacc_post.py -i phyloacc-test

Click here to see a breakdown the above command

Command line parameter	Description
phyloacc_post.py	The post-processing script after all batches have been run
-i phyloacc-test	The path to the directory that contains the PhyloAcc job files, which is the output directory (`-o`) specified when running `phyloacc.py`

To test PhyloAcc directly

If you wish to test the PhyloAcc-ST binary directly, run the following command:

PhyloAcc-ST phyloacc-test-config.txt

Click here to see a breakdown the above command

Command line parameter	Description
PhyloAcc-ST	The path to the PhyloAcc binary/executable
phyloacc-test-config.txt	The path to the PhyloAcc config file

This will take a little longer since it doesn't do batching, but should still finish within a few minutes.

Files in this repository

File	Description
id-subset.txt	A subset of the 500 simulated elements to run, one element ID per line
phyloacc-test-config.txt	A PhyloAcc config file as input to the PhyloAcc-ST binary
ratite.mod	The neutral rate and tree file from PHAST for the ratite data
README.md	This file!
simu_500_200_diffr_2-1.bed	A bed file that contains the partitions of the 500 simulated elements in the alignment file
simu_500_200_diffr_2-1.fa	Concatenated alignments for all 500 simulated elements in FASTA format

phyloacc/PhyloAcc-test-data

PhyloAcc test data

Test commands

To test batching with the interface

To test PhyloAcc directly

Files in this repository