This repository contains a small test dataset to run with PhyloAcc after installing.
The test dataset is based on the ratite dataset from Hu et al. 2019 and is composed of 500 simulated genomic elements each 200bp long with varying rates.
To use this test repo, first clone it by clicking the green Code button above and copying your preferred link. Then, in your shell clone the repo:
git clone [copied URL from green Code button]
Click here to see a breakdown the above command
Command line parameter | Description |
---|---|
git | This calls the main git program from the command line |
clone | The sub-program of git used to copy repositories from the web to your local machine |
[copied URL from green Code button] | The URL of the repository you wish to copy |
Then enter that directory:
cd PhyloAcc-test-data
Click here to see a breakdown the above command
Command line parameter | Description |
---|---|
cd | The Linux change directory command |
PhyloAcc-test-data | The path to the directory you want to enter |
There are two ways you can run PhyloAcc, either through the interface with phyloacc.py [options]
or directly with PhyloAcc-ST [config file]
. We strongly recommond using the interface to take care of batching with snakemake which will greatly increase the efficiency of running on large datasets.
First, run the interface to generate batches and snakemake files:
phyloacc.py -a simu_500_200_diffr_2-1.fa -b simu_500_200_diffr_2-1.bed -i id-subset.txt -m ratite.mod -o phyloacc-test -t "strCam;rhePen;rheAme;casCas;droNov;aptRow;aptHaa;aptOwe;anoDid" -g "allMis;allSin;croPor;gavGan;chrPic;cheMyd;anoCar" -n 4 -batch 5 -j 2 -part "[a valid partition on your cluster]"
Click here to see a breakdown the above command
Command line parameter | Description |
---|---|
phyloacc.py | This tells the shell to run the phyloacc Python interface -- use phyloacc.py -h for a full list of options |
-a simu_500_200_diffr_2-1.fa | This option specifies the path to a concatenated alignment file in FASTA format |
-b simu_500_200_diffr_2-1.bed | This option specifies the path to a bed file that contains partitions for each element in the alignment file |
-i id-subset.txt | This option specifies the path to a text file containing one element ID per line to subset the dataset -- PhyloAcc will only be run on these elements |
-m ratite.mod | This option specifies the path to the MOD file from PHAST, which contains pre-estimated neutral substitution rates and a species tree |
-o phyloacc-test | This option specifies the path to the desired output directory, which will be created if it does not exist |
-t "strCam;rhePen;rheAme;casCas;droNov;aptRow;aptHaa;aptOwe;anoDid" | A semi-colon separated list of target species |
-g "allMis;allSin;croPor;gavGan;chrPic;cheMyd;anoCar" | A semi-colon separated list of outgroup species |
-n 4 | This is the number of processes that phyloacc.py will use to generate the batches |
-batch 5 | This is the number of loci per batch |
-j 2 | This is the number of batches to submit to your cluster simultaneously |
-part "[a valid partition on your cluster]" | This should be a partition on your cluster to which you wish to submit jobs |
Next, run the resulting snakemake command that is printed to the screen. This will be different for every user, but should look something like this:
snakemake -p -s [/your/path/to]/PhyloAcc-test-data/phyloacc-test/phyloacc-job-files/snakemake/run_phyloacc.smk --configfile [/your/path/to]/PhyloAcc-test-data/phyloacc-test/phyloacc-job-files/snakemake/phyloacc-config.yaml --profile [/your/path/to]/PhyloAcc-test-data/phyloacc-test/phyloacc-job-files/snakemake/profiles/slurm_profile --dryrun
Click here to see a breakdown the above command
Command line parameter | Description |
---|---|
snakemake | A program that handles workflows and cluster job submission; this should be installed automatically when you install PhyloAcc from bioconda |
-p | This option tells snakemake to print the shell commands that are executed |
-s [/your/path/to]/PhyloAcc-test-data/phyloacc-test/phyloacc-job-files/snakemake/run_phyloacc.smk | This option specifies the path to the snakemake file; this is written automatically by phyloacc.py |
--configfile [/your/path/to]/PhyloAcc-test-data/phyloacc-test/phyloacc-job-files/snakemake/phyloacc-config.yaml | The path to the config file for a given workflow; this is written automatically by phyloacc.py |
--profile [/your/path/to]/PhyloAcc-test-data/phyloacc-test/phyloacc-job-files/snakemake/profiles/slurm_profile | The path to a directory containing a cluster profile so snakemake knows how to submit jobs to your cluster; this is written automatically by phyloacc.py |
--dryrun | The --dryrun option tells snakemake to report the commands it will execute without actually executing them |
Once there are no errors and you are satisfied this will run the batches you want, remove the --dryrun
option to run the batches with Snakemake. Since this is a very small dataset, it should only take a couple of minutes, but for large datasets you would want to run this in the background or submit the snakemake command as its own job.
Finally, after the batches have completed, you will want to gather the outputs with the post-processing script:
phyloacc_post.py -i phyloacc-test
Click here to see a breakdown the above command
Command line parameter | Description |
---|---|
phyloacc_post.py | The post-processing script after all batches have been run |
-i phyloacc-test | The path to the directory that contains the PhyloAcc job files, which is the output directory (-o ) specified when running phyloacc.py |
If you wish to test the PhyloAcc-ST binary directly, run the following command:
PhyloAcc-ST phyloacc-test-config.txt
Click here to see a breakdown the above command
Command line parameter | Description |
---|---|
PhyloAcc-ST | The path to the PhyloAcc binary/executable |
phyloacc-test-config.txt | The path to the PhyloAcc config file |
This will take a little longer since it doesn't do batching, but should still finish within a few minutes.
File | Description |
---|---|
id-subset.txt | A subset of the 500 simulated elements to run, one element ID per line |
phyloacc-test-config.txt | A PhyloAcc config file as input to the PhyloAcc-ST binary |
ratite.mod | The neutral rate and tree file from PHAST for the ratite data |
README.md | This file! |
simu_500_200_diffr_2-1.bed | A bed file that contains the partitions of the 500 simulated elements in the alignment file |
simu_500_200_diffr_2-1.fa | Concatenated alignments for all 500 simulated elements in FASTA format |