If you are adapting this code for your own use there are several things you will need to change in order to make it work for your system, these are in no particular order:
-
The TE types and families created in whitelist_type and whitelist_fam are unique to my analyses. You will have to change the dictionary there to sort TE classifications a different way.
-
I split up my genes into several files grouped by their chromosomes. To avoid working with a difficult gene file format, I make use of an mRNA bedfile to make getting all of the gene information easier. Both the bedfile and the gtf/gff file for the genes need to be split along chromosome identities.
-
There can be cases where a TE completely encompasses a gene from left to right, this is most likely an artifact of the TE annotation software, to deal with this, I do not add the TE values to the current density, and I output it to an overlap file so we can look at it later. Unfortunately I have hardcoded in the naming scheme for the overlap file. If you want to edit this you need to make sure the list matches with the gtf_inputfile argument in get_densities.
-
I make use of the multiprocessing library in Python to run each chromosome on its own processor in order to speed things up. The pre-algorithm preparation doesn't take very long by itself (10-20 seconds) but I run that prior to starting the main density algorithm for each chromosome just for simplicity's sake.
-
It may be useful in the future to edit out the hard-coded class attribute assignments towards the end of the density algorithm in order to make it more friendly. When I have the free time I will update it to use the setattr() function.
-
10/31/2018: I archived the H4 project. I have done this because I have consolidated the code so that it may run on both without needing to maintain separate versions of the code. Right now it runs on Camarosa, but to configure it for H4, you just need to make the selection at the bottom of Density_Expression. You still need the appropriate mRNA files, gtf files, and the gff of transposons.
-
To get up and running, unzip the TE_Bank, mRNA_Bank, and Gene_Bank to get all the requisite starting files. To run everything, run Density_Expression.py