MiraldiLab/maxATAC

Finalize Code Base

Closed this issue · 14 comments

  • Delete Refactor and Develop branches
  • Discuss hg38 meta file management
  • Make PyPi package (MUST INCLUDE 'maxatac prepare')
  • Make Docker containers
  • Modify install instructions accordingly
  • Setup docs_rendering (with https://readthedocs.org/ - need to be public repo for that)
  • Test install
  • maxATAC_data needs to be properly linked to maxATAC
  • Repository is too big. We will need to remove all files that were added here by mistake.
  • Add LICENSE
  • Add CODE_OF_CONDUCT.md
  • Add CONTRIBUTING.md
  • #104
  • Make sure we are not bound to Python 3.6 as it will be deprecated soon
  • Set author and author_email in setup.py
  • Make sure we don't have anything identical to the code we used as an example!

hg38.2bit file is too big. So I think it would be best to create a shell script which will wget the 2bit file from the ucsc genome browser. We might as well do the same with the chrom.sizes file as well.

@emiraldi I have a few questions about how we want user to interact with maxatac predict. We currently output a bigwig by default with values from 0-1 for the whole genome (or a subset if the user wants specific chromosomes). The minimum command to run this looks like:

maxatac predict --sequence hg38.2bit \
--models maxatac/data/models/CTCF/CTCF_binary_revcomp99_fullModel_RR0_95.h5 \
--signal GM12878.bigwig

I think more users will be interested in how to get peaks from those predictions, not just a bigwig. If you provide a threshold stats file that is specific to the TF model, maxATAC will call "peaks" or bins that are above a specific threshold. The thresholds can be defined by different validation metrics like average precision, recall, and F1 score. With the current version of maxATAC, the user needs to specifically provide that stats file for the specific TF model.

Example:

maxatac predict --sequence hg38.2bit \
--models maxatac/data/models/CTCF/CTCF_binary_revcomp99_fullModel_RR0_95.h5 \
--cutoff_type Precision \
--cutoff_value .9 \
--cutoff_file maxatac/data/models/CTCF/CTCF_validationPerformance_vs_thresholdCalibration.tsv \
--signal GM12878.bigwig

Is this format alright for publication, or should we have some code that will find the best mode/cutoff file based on a flag for the TF name?

Example:

maxatac easy_predict -TF CTCF \
--signal GM12878.bigwig 

If users want specific type of peaks:

maxatac easy_predict -TF CTCF \
--signal GM12878.bigwig \
--cutoff_type Precision \
--cutoff_value .9 

I think the latter option just takes out the guesswork of trying to lineup model + threshold file (even though it should be easy because they are grouped in a specific TF directory already).

@tacazares Great ideas! Yes, I like the example commands -- very wise not to burden the user with pointing to the particular model. Some other considerations:

  • I think .bed file should be the default output -- user gets that no matter what. If they don't specify cutoff, they get max f1-score.
  • I'm on the fence want to provide signal tracks as the default output. They're great to look at but also a bit heavy, so I'm torn. I'm kind of tempted to make both .bed and signal tracks outputs, because signal tracks look cool.
  • How are we going to deal with multiple TF predictions? Is there a way to provide a .txt file listing multiple TF names or should we have users simply write a for-loop (and parallelize themselves!)

Thank you!
Emily

These are some nice suggestions! I'll hold off on making the pypi package again and the remaining install progress until this has been incorporated. @tacazares, do you need some help with making these changes?

  • I think .bed file should be the default output -- user gets that no matter what. If they don't specify cutoff, they get max f1-score.
  • I'm on the fence want to provide signal tracks as the default output. They're great to look at but also a bit heavy, so I'm torn. I'm kind of tempted to make both .bed and signal tracks outputs, because signal tracks look cool.

I'd like having both bed and bw files being output.

  • How are we going to deal with multiple TF predictions? Is there a way to provide a .txt file listing multiple TF names or should we have users simply write a for-loop (and parallelize themselves!)

If the user wants to do multiple TFs he/she should be able to write a for loop imho.

We might consider having some advanced options where the user can limit to "bed-only", if storage space is an issue. Thoughts? Worth doing?

PS @FaizRizvi I agree about for loop for multiple TFs!

@emiraldi @FaizRizvi I am thinking about MACS2 and their approach. They will produce peaks which are regions of the genome that are enriched above background by default. They only output the BED file. The user can then specify whether they want to output the raw bedgraph file that was used to call peaks. MACS can do this because it only takes minutes to run and call peaks. Our methods takes hours and a lot of resources. So it might be worth having the signal tracks output by default until we can nail down our version of peak calling or speed up the method.

Hey @emiraldi I spoke with @michael-kotliar and he suggested the following for our install directions:

  1. git clone https://github.com/MiraldiLab/maxATAC_data.git

  2. wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit

  3. Copy maxATAC_data to local location:

    mkdir -p /opt/maxatac/data/
    cp -r ./maxATAC_data /opt/maxatac/

  4. Install maxATAC

    make env
    pip install maxatac==1.0.1

    or

    Docker

If we do it this way then maxatac predict should be kept the same way as is.

maxatac predict 
--sequence hg38.2bit 
--models maxatac/data/models/CTCF/CTCF_binary_revcomp99_fullModel_RR0_95.h5 
--cutoff_type Precision 
--cutoff_value .9 
--cutoff_file maxatac/data/models/CTCF/CTCF_validationPerformance_vs_thresholdCalibration.tsv 
--signal GM12878.bigwig

This method ensures the user is responsible for copying the paths correctly as well as using them. If we automate this process this will introduce numerous possible errors.

As far as storage space is concerned, I dont think this is that big of an issue. I took a look at our outputs for and bw files are about 50-60 mb predicted on chr1 and 600 mb for prediction on hg38. For peak calling the out put bed file is 50 mb for chr1.

We can let the user know, make sure you have at least 1 gb of space to store your output.

@emiraldi: CODE_OF_CONDUCT.md
I put my email in there temporarily under enforcement. This can be changed as you think is best.
@emiraldi: CONTRIBUTING.md
How should we write this section of the repo? Here is an example that I saw: https://github.com/torus-online/torus/blob/master/CONTRIBUTING.MD

Do we want maxATAC to be improved upon by others in the community and merged back into maxATAC?

We need to update the docs for prediction, specifically variant based predictions. We also need to clarify that the -m flag will use the model where the -tf is used to find the correct model for a TF.