Finalize Code Base

Question

Finalize Code Base

Closed this issue 2 years ago · 14 comments

Answer 1 · 2022-02-01T18:13:30.000Z

hg38.2bit file is too big. So I think it would be best to create a shell script which will wget the 2bit file from the ucsc genome browser. We might as well do the same with the chrom.sizes file as well.

Answer 2 · 2022-02-01T18:40:14.000Z

Sounds good to me!

…

On Tue, Feb 1, 2022 at 1:13 PM FaizRizvi ***@***.***> wrote: hg38.2bit file is too big. So I think it would be best to create a shell script which will wget the 2bit file from the ucsc genome browser. We might as well do the same with the chrom.sizes file as well. — Reply to this email directly, view it on GitHub <#80 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUJXVMNDZ35CUOTOXJPUIDUZAPFJANCNFSM5LW6FELQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Emily Miraldi, Ph.D. Assistant Professor Divisions of Immunobiology and Biomedical Informatics Cincinnati Children's Hospital

Answer 3 · 2022-02-02T23:11:25.000Z

@emiraldi I have a few questions about how we want user to interact with maxatac predict. We currently output a bigwig by default with values from 0-1 for the whole genome (or a subset if the user wants specific chromosomes). The minimum command to run this looks like:

maxatac predict --sequence hg38.2bit \
--models maxatac/data/models/CTCF/CTCF_binary_revcomp99_fullModel_RR0_95.h5 \
--signal GM12878.bigwig

I think more users will be interested in how to get peaks from those predictions, not just a bigwig. If you provide a threshold stats file that is specific to the TF model, maxATAC will call "peaks" or bins that are above a specific threshold. The thresholds can be defined by different validation metrics like average precision, recall, and F1 score. With the current version of maxATAC, the user needs to specifically provide that stats file for the specific TF model.

Example:

maxatac predict --sequence hg38.2bit \
--models maxatac/data/models/CTCF/CTCF_binary_revcomp99_fullModel_RR0_95.h5 \
--cutoff_type Precision \
--cutoff_value .9 \
--cutoff_file maxatac/data/models/CTCF/CTCF_validationPerformance_vs_thresholdCalibration.tsv \
--signal GM12878.bigwig

Is this format alright for publication, or should we have some code that will find the best mode/cutoff file based on a flag for the TF name?

Example:

maxatac easy_predict -TF CTCF \
--signal GM12878.bigwig

If users want specific type of peaks:

maxatac easy_predict -TF CTCF \
--signal GM12878.bigwig \
--cutoff_type Precision \
--cutoff_value .9

I think the latter option just takes out the guesswork of trying to lineup model + threshold file (even though it should be easy because they are grouped in a specific TF directory already).

Answer 4 · 2022-02-02T23:29:26.000Z

@tacazares Great ideas! Yes, I like the example commands -- very wise not to burden the user with pointing to the particular model. Some other considerations:

I think .bed file should be the default output -- user gets that no matter what. If they don't specify cutoff, they get max f1-score.
I'm on the fence want to provide signal tracks as the default output. They're great to look at but also a bit heavy, so I'm torn. I'm kind of tempted to make both .bed and signal tracks outputs, because signal tracks look cool.
How are we going to deal with multiple TF predictions? Is there a way to provide a .txt file listing multiple TF names or should we have users simply write a for-loop (and parallelize themselves!)

Thank you!
Emily

Answer 5 · 2022-02-03T01:11:20.000Z

These are some nice suggestions! I'll hold off on making the pypi package again and the remaining install progress until this has been incorporated. @tacazares, do you need some help with making these changes?

Answer 6 · 2022-02-03T01:14:25.000Z

I think .bed file should be the default output -- user gets that no matter what. If they don't specify cutoff, they get max f1-score.

I'm on the fence want to provide signal tracks as the default output. They're great to look at but also a bit heavy, so I'm torn. I'm kind of tempted to make both .bed and signal tracks outputs, because signal tracks look cool.

I'd like having both bed and bw files being output.

How are we going to deal with multiple TF predictions? Is there a way to provide a .txt file listing multiple TF names or should we have users simply write a for-loop (and parallelize themselves!)

If the user wants to do multiple TFs he/she should be able to write a for loop imho.

Answer 7 · 2022-02-03T01:46:12.000Z

We might consider having some advanced options where the user can limit to "bed-only", if storage space is an issue. Thoughts? Worth doing?

Answer 8 · 2022-02-03T01:56:42.000Z

PS @FaizRizvi I agree about for loop for multiple TFs!

Answer 9 · 2022-02-03T04:32:04.000Z

@emiraldi @FaizRizvi I am thinking about MACS2 and their approach. They will produce peaks which are regions of the genome that are enriched above background by default. They only output the BED file. The user can then specify whether they want to output the raw bedgraph file that was used to call peaks. MACS can do this because it only takes minutes to run and call peaks. Our methods takes hours and a lot of resources. So it might be worth having the signal tracks output by default until we can nail down our version of peak calling or speed up the method.

Answer 10 · 2022-02-04T01:00:09.000Z

Hey @emiraldi I spoke with @michael-kotliar and he suggested the following for our install directions:

git clone https://github.com/MiraldiLab/maxATAC_data.git
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
Copy maxATAC_data to local location:

mkdir -p /opt/maxatac/data/
cp -r ./maxATAC_data /opt/maxatac/
Install maxATAC

make env
pip install maxatac==1.0.1

or

Docker

If we do it this way then maxatac predict should be kept the same way as is.

maxatac predict 
--sequence hg38.2bit 
--models maxatac/data/models/CTCF/CTCF_binary_revcomp99_fullModel_RR0_95.h5 
--cutoff_type Precision 
--cutoff_value .9 
--cutoff_file maxatac/data/models/CTCF/CTCF_validationPerformance_vs_thresholdCalibration.tsv 
--signal GM12878.bigwig

This method ensures the user is responsible for copying the paths correctly as well as using them. If we automate this process this will introduce numerous possible errors.

As far as storage space is concerned, I dont think this is that big of an issue. I took a look at our outputs for and bw files are about 50-60 mb predicted on chr1 and 600 mb for prediction on hg38. For peak calling the out put bed file is 50 mb for chr1.

We can let the user know, make sure you have at least 1 gb of space to store your output.

Answer 11 · 2022-02-04T03:37:39.000Z

If that's Michael's best advice then let's follow it. Something to be said about clarity of paths. Also, if someone wanted to run on hg19 or mm10 that would presumably work for their hg19 or mm10 signal track and corresponding 2bit files -- actually it wouldn't because we only have hg38 blacklist. Will have to make that clear and warn them in the docs. Hg38 only -- for the moment. Re: trouble with space: 100 TFs x 660MB = 66GB, so we'll need to make this clear in documentation and make it clear how to turn it off the signal track output as a non-default option. If people have multiple ATAC experiments... For running prediction using default cutoff (f1-score) how will that command look? Let me know when you're ready for codebase review or if you want to meet tomorrow to discuss any issues. Thank you! Emily

On Thu, Feb 3, 2022 at 8:00 PM FaizRizvi ***@***.***> wrote: Hey @emiraldi <https://github.com/emiraldi> I spoke with @michael-kotliar <https://github.com/michael-kotliar> and he suggested the following for our install directions: 1. git clone https://github.com/MiraldiLab/maxATAC_data.git 2. wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit 3. Copy maxATAC_data to local location: mkdir -p /opt/maxatac/data/ cp -r ./maxATAC_data /opt/maxatac/ 4. Install maxATAC make env pip install maxatac==1.0.1 or Docker If we do it this way then maxatac predict should be kept the same way as is. maxatac predict --sequence hg38.2bit --models maxatac/data/models/CTCF/CTCF_binary_revcomp99_fullModel_RR0_95.h5 --cutoff_type Precision --cutoff_value .9 --cutoff_file maxatac/data/models/CTCF/CTCF_validationPerformance_vs_thresholdCalibration.tsv --signal GM12878.bigwig This method ensures the user is responsible for copying the paths correctly as well as using them. If we automate this process this will introduce numerous possible errors. As far as storage space is concerned, I dont think this is that big of an issue. I took a look at our outputs for and bw files are about 50-60 mb predicted on chr1 and 600 mb for prediction on hg38. For peak calling the out put bed file is 50 mb for chr1. We can let the user know, make sure you have at least 1 gb of space to store your output. — Reply to this email directly, view it on GitHub <#80 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUJXVM7VJEYDJSWEJ3QSHTUZMQKLANCNFSM5LW6FELQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Emily Miraldi, Ph.D. Assistant Professor Divisions of Immunobiology and Biomedical Informatics Cincinnati Children's Hospital

Answer 12 · 2022-02-07T01:06:17.000Z

@emiraldi: CODE_OF_CONDUCT.md
I put my email in there temporarily under enforcement. This can be changed as you think is best.
@emiraldi: CONTRIBUTING.md
How should we write this section of the repo? Here is an example that I saw: https://github.com/torus-online/torus/blob/master/CONTRIBUTING.MD

Do we want maxATAC to be improved upon by others in the community and merged back into maxATAC?

Answer 13 · 2022-02-07T14:56:55.000Z

Hi Faiz, Thanks for the link to Torus. I think we will want users to contribute by opening issues, and I like Torus's guidelines: - I have a question - reporting bugs - suggesting enhancements Contributions beyond that scope are out of the picture for now, but we could imagine having contributors from other groups at some point. Question: Is it possible to track the number of times someone clones the repository? This will be important to getting software support eventually. I did some poking around but have to take a break: https://stackoverflow.com/questions/10056638/how-to-get-github-clone-stats/25270050#25270050 Also, how soon before we can have internal people Joe/Anthony, Weirauch lab friends try out the codebase? I could review later today or tomorrow, when ready. Thank you! Emily

…

On Sun, Feb 6, 2022 at 8:06 PM FaizRizvi ***@***.***> wrote: @emiraldi <https://github.com/emiraldi>: CODE_OF_CONDUCT.md I put my email in there temporarily under enforcement. This can be changed as you think is best. @emiraldi <https://github.com/emiraldi>: CONTRIBUTING.md How should we write this section of the repo? Here is an example that I saw: https://github.com/torus-online/torus/blob/master/CONTRIBUTING.MD Do we want maxATAC to be improved upon by others in the community and merged back into maxATAC? — Reply to this email directly, view it on GitHub <#80 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUJXVNU4UGJWXEQRVMG27TUZ4LJJANCNFSM5LW6FELQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Emily Miraldi, Ph.D. Assistant Professor Divisions of Immunobiology and Biomedical Informatics Cincinnati Children's Hospital

Answer 14 · 2022-05-13T18:13:59.000Z

We need to update the docs for prediction, specifically variant based predictions. We also need to clarify that the -m flag will use the model where the -tf is used to find the correct model for a TF.