/CBCR

CBCR: A Curriculum Based Strategy For Chromosome Reconstruction

Primary LanguageMATLAB

CBCR: A Curriculum Based Strategy For Chromosome Reconstruction

OluwadareLab, University of Colorado, Colorado Springs


Developers:
              Van Hovenga
              Department of Mathematics
              University of Colorado, Colorado Springs
              Email: vhovenga@uccs.edu

Contact:
              Oluwatosin Oluwadare, PhD
              Department of Computer Science
              University of Colorado, Colorado Springs
              Email: ooluwada@uccs.edu


Cite: Hovenga V, Oluwadare O. CBCR: A Curriculum Based Strategy For Chromosome Reconstruction. International Journal of Molecular Sciences. 2021; 22(8):4140. https://doi.org/10.3390/ijms22084140


1. Content of folders:

  • src: Matlab source code.
  • GM12878_Output: Output structions and log files generated from the GM12878 cell line using the Mbol restriction enzyme at four different resolutions (1Mb, 500Kb, 250Kb, 100Kb). These structures were generated using the primary-replicate combined mappings.
  • Validation: Output structures and log files generated from the GM12878 cell line used for validation at 1Mb resolution.
    • Mpol_Primary_Output: Outputs generated from the primary mappings using the Mbol restriction enzyme.
    • Mbol_Replicate_Output: Outputs generated from the replicate mappings using the Mbol restriction enzyme.
    • DpnII_Combined: Outputs generated from the primary-replicate combined mappings using the DpnII restriction enzyme.

2. Hi-C Data used in this study:

In our study, we used the synthetic dataset from Adhikari, et al. The contact maps, the original models and their reconstructed models used in this study were downloaded from here. We also used the synthetic dataset from Zou, et al which can be downloaded from here.

The GM12878 cell Hi-C dataset, GEO Accession number GSE63525, was downloaded from GSDB with GSDB ID: OO7429SF

3. Input file format:

CBCR allows two input formats:

  • Square matrix input format: A square matrix of size N by N consisting of intra-chromosomal contact matrix derived from Hi-C data, where N is the number of equal-sized regions of the chromosome.
  • Tuple input format: A hi-C contact file, each line contains 3 numbers (separated by a space) of a contact, position_1 position_2 interaction_frequencies.

4. Usage:

Usage: To use, type in the terminal CBCR(input, learning_rate, conversion, max_iter_0, max_iter_1, verbose)

  • Arguments:
    • input: A string for the path of the input file
    • learning_rate: The learning rate of the algorithm [Recommended value: .2].
    • conversion: Vector or scalar. The factor(s) used to convert IF to distance, distance = 1/(IF^factor). When a vector is used, a structure is generated at every conversion factor in the vector and the value which maximizes the distance Spearman correlation coefficient is selected as the representitve structure. For example, if the input is [.1, .3, .5, .7,.9, 1, 1.3, 1.5], then CBCR generates a structure for each value and selects whichever one that maximizes dSCC as the representitave structure. A vector input is recommended for a thorough search. When a scalar is used, user only needs to provide a single value, For example, an input value of 0.5
    • max_iter_0: The maximum total number of iterations over all sub-curricula combined. This value should be smaller for smaller inputs, and larger for larger inputs. A value of 1,000 was used for the 1Mb and 500Kb, and 10,000 for the 250Kb and 100Kb resolutions in this study on the GM12878 input data
    • max_iter_1: The maximum total number of iterations over the final training of CBCR if early convergence is met. A value of 500 was used for the 1Mb and 500Kb, and a value of 1,000 was used for the 250Kb and 100Kb resolutions in this study on the GM12878 input data.
    • verbose: Integer. Controls the output of CBCR in the console. A value of 0 will display only the current curricula. A value of 1 will display the current curricula and each iteration with the corresponding loss, and value for alpha and beta. A value of 2 will display the outputs of verbose = 1 and a plot that displays the evolution of the chromosome as training progresses. Note that this option will slow down CBCR.

6. Output:

CBCR outputs three files:

  1. .pdb: The protein data bank file of the representative structure.
  2. .log: A log file that tells the input file, the optimal structure file name, the optimal conversion factor, and the corresponding dSCC, sPCC, and dRMSE.
  3. _coordinate_mapping.txt: contains the mapping of genomic positions to indices in the model. Notice that indices start from 0, while in pyMol or Chimera, id starts from 1