/domainClassifyR

Classification of TAD structures

Primary LanguageR

domainClassifyR

The aim of this analysis was to identify distinct distributions of interacting contacts within genomic regions. A pair of thresholds is applied to the Z-scores: a minimum threshold to identify sectors that are enriched and a maximum threshold to identify sectors that are not enriched.

Installation

Please have 'misha' R package installed prior to running run_domainClassifyR.R, with a GENOME_DB configured (minimum example provided in example_data).

Install domainClassifyR using devtools::github('ChristopherBarrington/domainClassifyR'). Tested under Linux R 3.5.1 with:

 package         * version    date
 devtools        * 1.13.6     2018-06-27
 domainClassifyR * 0.0.0.9000 2018-11-02
 doMC              1.3.5      2017-12-12
 dplyr           * 0.7.7      2018-10-16
 ggplot2         * 3.1.0      2018-10-25
 magrittr          1.5        2014-11-22
 misha           * 4.0.4      2018-09-24
 plyr            * 1.8.4      2016-06-08
 purrr             0.2.5      2018-05-29
 scales            1.0.0      2018-08-09
 snow            * 0.4-2      2016-10-14
 tibble            1.4.2      2018-01-22
 tidyselect        0.2.5      2018-10-11

Steps

Getting contacts within 2D intervals

get_contacts() is used to query the track_nm score track to select all interacting fends within each 2D interval in domains. The data are saved into a list and cached in cache.path to quickly reload the same 2D interval.

  • domains A data.frame of 2D intervals
  • track_nm Name of the score track from which fend pairs are extracted
  • cache.path Absolute path to the directory that cached domains can be found or saved (will be created)

All subsequent functions use the list of contacts returned by get_contacts() as input, specified in each by the contacts argument.

See ?get_contacts

Filtering and profiling contacts

Once the 2D intervals are cached, they are loaded into a list (one element per 2D interval). Subsequent characterisations of 2D intervals are required to format the data for sampling.

add.domain.features() appends information to the DOMAIN element and to the all contacts data.frame: the sizes of the 2 intervals are recorded as well as the position in resolution bins of the scored fend pair relative to the start and end of the 2D interval. Interactors (high-score fend pairs) with a bin position of 0 would be considered as leading- or trailing-edge).

  • resolution The size in bp of the trailing- or leading- edge (and therefore the size of the interval 'corner')

get.high.score.fends() duplicates the ALL contacts data.frame, selecting only those fend pairs with score above min_score into the HIGH_SCORE element.

  • min_score The minimum score for which fend pairs are considered interactors

count.contacts() is used to count the number of contact pairs in the 2D interval (and takes no arguments beyond the contacts list).

get.domain.position.classification() calculates the number of contacts (and interactors) in each sector of each 2D interval using the binned positions of contacts from the add.domain.features function and the resolution parameter. The function also calculates the number of interactors in a sector as a proportion of all contacts in the 2D interval ($FEND.SCORE.DISTRIBUTION$SCORES_dHS_mALL) and of the total number of interactors ($FEND.SCORE.DISTRIBUTION$SCORES_dALL_mHS). This function takes no arguments beyond the contacts list.

See ?add.domain.features, ?get.high.score.fends, ?count.contacts, ?get.domain.position.classification

Randomly sampling high-score intra-interval contacts

The contacts within each 2D interval are then randomly sampled to generate a random distribution of contacts within each sector that can be compared to the observed number of interactors in each sector. get.fend.samples() is used to randomly select contacts (ignoring score) and count how many of the randomly selected contacts are positioned in each of the four sectors. The number of contacts sampled (using R sample()) is determined by the number of interactors. The distribution is generated by repeating the sampling process n times; a data.frame of n rows is produced with each column showing the number of randomly selected contacts in a sector ($CONTACTS$RANDOM_SAMPLE_TABLES). The process can be parallelised using the doMC package and the cluster_size parameter to determine the number of parallel threads. For reproducibility a seed can be set, which is determined by the 2D interval ID - expecting an ID of 12:34 the seed would be 1234.

  • random_seed Boolean whether to use the intervalID to set a seed
  • n number of times to repeat the sampling process
  • cluster_size number of parallel threads to run or 1 for no parallelisation

See ?get.fend.samples

Compute Z-scores for each sector of each 2D interval

Using the number of sampled contacts per sector, get_Z_statistics() calculates the mean and standard deviation of the sector distributions which are used with the number of observed interactors to calculate a Z-score for each sector (of each 2D interval). This is saved into $Z_STATISTICS. Z-scores are converted to p-values using pnorm (P). get_min_fend_pair_domains() is used to filter the 2D interval set, requiring that an interval pair has at least min interactors. Multiple testing is then corrected for using get_Q_values(). The 2D intervals set is collapsed to a data such that each row is a filtered 2D interval and each column is P for each sector in the 2D interval. Each column is corrected using the method specified by method which is passed to p.adjust().

The enrichment of each sector in 2D regions is calculated independently so each region may have multiple enriched sectors. The classification of the 2D region is the aggregate of the enriched domain sectors. Regions that have a sector classified as ambiguous are excluded from further analysis. Increasing the lower threshold or decreasing the upper threshold would decrease the number of ambiguous sectors identified but could decrease specificity (or homogeneity) of the classified groups. Therefore, a TAD presented with a ‘Corner’ classification (loop domain) shows specific (and exclusive) enrichment of high-scoring contacts in the 60kb interacting region between the TAD borders. TADs with multiple enriched sectors were not considered; in these TADs, the Leading+Corner or Trailing+Corner sectors were most common.

  • min Minimum number of contacts or interactors required in a domain
  • measure Should be 'HIGH_SCORE' or 'ALL' to determine which contact set is filtered by
  • method The correction method to apply, see ?p.adjust

See ?get_Z_statistics, ?get_min_fend_pair_domains, ?get_Q_values

MWE:

Using example data, script should complete in minutes.

library(misha)
library(domainClassifyR)

# set the misha GENOME_DB root (eg extract the example_data mm10 archive into /absolute/path/to/GENOME_DB)
gsetroot('/absolute/path/to/GENOME_DB/mm10/')

# load data / set parameters
intervals_2d <- gintervals.load("intervs.chr19_domains")
track_nm <- 'hic.example_project.example_dataset.score'
cache_path <- file.path(getwd(), 'domainclassifyr_cache')

# run the parts of the package
contacts <- get_contacts(domains=intervals_2d,
                         track_nm=track_nm,
                         cache.path=cache_path,
                         band=c(10e3,15e6),
                         get.fends=FALSE)
contacts <- add.domain.features(contacts, resolution=60e3)
contacts <- get.high.score.fends(contacts, min_score=40)
contacts <- count.contacts(contacts)
contacts <- get.domain.position.classifications(contacts)
contacts <- get.fend.samples(contacts, n=1000)
contacts <- get_Z_statistics(contacts)
contacts <- get_min_fend_pair_domains(contacts, min=100)
contacts <- get_Q_values(contacts)

# get a data.frame of qvalues for each sector of the domain
> ldply(contacts, function(d) d$Z_STATISTICS$OBSERVED$Q) %>% tibble::as.tibble()
# # A tibble: 77 x 5
#    .id         CORNER  FORWARD   REVERSE    OTHER
#    <chr>        <dbl>    <dbl>     <dbl>    <dbl>
#  1 2012:2012 1.00e+ 0 1.48e- 4 0.        1.00e+ 0
#  2 2013:2013 1.00e+ 0 1.00e+ 0 1.00e+  0 3.30e-21
#  3 2015:2015 1.00e+ 0 4.75e-65 1.00e+  0 1.00e+ 0
#  4 2017:2017 1.32e-52 1.00e+ 0 8.29e-  2 8.92e- 1
#  5 2019:2019 1.27e- 8 1.00e+ 0 9.79e-209 1.00e+ 0
#  6 2021:2021 4.54e-14 3.54e- 3 1.00e+  0 1.00e+ 0
#  7 2022:2022 1.00e+ 0 1.00e+ 0 1.00e+  0 6.35e-60
#  8 2023:2023 1.00e+ 0 1.00e+ 0 1.00e+  0 1.33e-11
#  9 2024:2024 4.34e-62 1.00e+ 0 1.15e-186 1.00e+ 0
# 10 2026:2026 5.88e-55 2.54e-25 1.00e+  0 1.00e+ 0
# # ... with 67 more rows

# plot the contacts in one domain
ggplot2::ggplot(contacts[['2106:2106']]$CONTACTS$HIGH_SCORE)+
  aes(x=FEND.X,y=FEND.Y,colour=SCORE)+
  geom_point(shape=20)+coord_fixed()+theme(aspect.ratio=1)

Example output

Running the MWE

This should produce not produce any output except for progress bars (unless there is an error).

> contacts <- get_contacts(domains=intervals_2d,
+                          track_nm=track_nm,
+                          cache.path=cache_path,
+                          band=c(10e3,15e6),
+                          get.fends=FALSE)
[get.contacts] Getting contact matrices for 85 domains
[get.contacts] Saving new domains to cache [n=85]
  |======================================================================| 100%
[get.contacts] Preallocating a 85 element list
[get.contacts] Loading matrices from cache (/home/local/domainclassifyr_cache)
  |======================================================================| 100%
> contacts <- add.domain.features(contacts, resolution=60e3)
[add.domain.features] Adding domain features
  |======================================================================| 100%
> contacts <- get.high.score.fends(contacts, min_score=40)
[get.high.score.fends] Identifying high-scoring fends [>=40]
  |======================================================================| 100%
> contacts <- count.contacts(contacts)
[count.contacts] Counting contacts
  |======================================================================| 100%
> contacts <- get.domain.position.classifications(contacts)
Getting distribution of fend positions
  |======================================================================| 100%
> contacts <- get.fend.samples(contacts, n=1000)
[get.fend.samples] Randomly selecting contacts from ALL [x 1,000]
> contacts <- get_Z_statistics(contacts)
[get_Z_statistics] Calculating Z-statistics and p-values
  |======================================================================| 100%
> contacts <- get_min_fend_pair_domains(contacts, min=100)
[get_min_fend_pair_domains] Selecting domains with at least 100 fend pairs in HIGH_SCORE
> contacts <- get_Q_values(contacts)
[get_Q_values] Correcting for multiple testing
>

Example structure the results

Shown is the structure for the first TAD in the contacts list of 2D intervals.

> str(contacts[[1]])
List of 6
 $ DOMAIN                 :'data.frame':	1 obs. of  13 variables:
  ..$ chrom1     : Factor w/ 1 level "chr19": 1
  ..$ start1     : num 3172500
  ..$ end1       : num 4974499
  ..$ chrom2     : Factor w/ 1 level "chr19": 1
  ..$ start2     : num 3172500
  ..$ end2       : num 4974499
  ..$ intervalID : Factor w/ 1 level "2012:2012": 1
  ..$ intervalID1: num 2012
  ..$ SIZE1      : num 1801999
  ..$ intervalID2: num 2012
  ..$ SIZE2      : num 1801999
  ..$ L1         : Factor w/ 1 level "SELF": 1
  ..$ AREA       : num 3.25e+12
 $ CONTACTS               :List of 3
  ..$ ALL                 :'data.frame':	88394 obs. of  7 variables:
  .. ..$ FEND.X       : num [1:88394] 3669732 3671461 3665714 3671461 3663092 ...
  .. ..$ FEND.Y       : num [1:88394] 4965109 4965109 4973544 4973544 4965622 ...
  .. ..$ SCORE        : num [1:88394] -13.2 -13.7 -16.8 -12.7 -14.3 ...
  .. ..$ GROUP.X      : num [1:88394] 8 8 8 8 8 8 8 8 8 8 ...
  .. ..$ GROUP.Y      : num [1:88394] 0 0 0 0 0 0 0 0 0 0 ...
  .. ..$ FEND.DISTANCE: num [1:88394] 1295377 1293648 1307830 1302083 1302530 ...
  .. ..$ POSITION     : Factor w/ 4 levels "CORNER","FORWARD",..: 3 3 3 3 3 3 3 3 3 3 ...
  ..$ HIGH_SCORE          :'data.frame':	3113 obs. of  7 variables:
  .. ..$ FEND.X       : num [1:3113] 3796255 3734437 3734436 3734436 3734437 ...
  .. ..$ FEND.Y       : num [1:3113] 3980935 3919813 3919812 3904170 3904170 ...
  .. ..$ SCORE        : num [1:3113] 40.1 40.1 40.1 40.6 40.6 ...
  .. ..$ GROUP.X      : num [1:3113] 10 9 9 9 9 9 9 9 9 9 ...
  .. ..$ GROUP.Y      : num [1:3113] 16 17 17 17 17 17 17 17 17 17 ...
  .. ..$ FEND.DISTANCE: num [1:3113] 184680 185376 185376 169734 169733 ...
  .. ..$ POSITION     : Factor w/ 4 levels "CORNER","FORWARD",..: 4 4 4 4 4 4 4 4 4 4 ...
  ..$ RANDOM_SAMPLE_TABLES:List of 1
  .. ..$ OBSERVED: int [1:1000, 1:4] 0 0 0 0 0 1 0 0 0 0 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : NULL
  .. .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
 $ CACHE.FILE             : chr "/home/local/domainclassifyr_cache/1c7f7c30abbde91764ed70e8ddc14d62.RData"
 $ N_CONTACTS             :List of 2
  ..$ ALL       : int 88394
  ..$ HIGH_SCORE: int 3113
 $ FEND.SCORE.DISTRIBUTION:List of 7
  ..$ ALL            : 'table' num [1:4(1d)] 3 645 4697 83049
  .. ..- attr(*, "dimnames")=List of 1
  .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
  ..$ HIGH_SCORE     : 'table' num [1:4(1d)] 0 41 1196 1876
  .. ..- attr(*, "dimnames")=List of 1
  .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
  ..$ SCORES_dALL_mHS: 'table' num [1:4(1d)] 0 197.9 792.7 70.3
  .. ..- attr(*, "dimnames")=List of 1
  .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
  ..$ SCORES_dHS_mALL: 'table' num [1:4(1d)] 0 8.5 1804.6 50048.2
  .. ..- attr(*, "dimnames")=List of 1
  .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
  ..$ SCORES_dALL    : 'table' num [1:4(1d)] 0 0.0636 0.2546 0.0226
  .. ..- attr(*, "dimnames")=List of 1
  .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
  ..$ SCORES_dHS     : 'table' num [1:4(1d)] 0 0.0132 0.3842 0.6026
  .. ..- attr(*, "dimnames")=List of 1
  .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
  ..$ SCORES         : 'table' num [1:4(1d)] 0 0.0132 0.3842 0.6026
  .. ..- attr(*, "dimnames")=List of 1
  .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
 $ Z_STATISTICS           :List of 1
  ..$ OBSERVED:List of 7
  .. ..$ MEAN: Named num [1:4] 0.105 22.711 165.189 2924.995
  .. .. ..- attr(*, "names")= chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
  .. ..$ SDEV: Named num [1:4] 0.313 4.776 12.401 13.192
  .. .. ..- attr(*, "names")= chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
  .. ..$ OBS : 'table' num [1:4(1d)] 0 41 1196 1876
  .. .. ..- attr(*, "dimnames")=List of 1
  .. .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
  .. ..$ Z   : 'table' num [1:4(1d)] -0.335 3.829 83.125 -79.52
  .. .. ..- attr(*, "dimnames")=List of 1
  .. .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
  .. ..$ P   : 'table' num [1:4(1d)] 6.31e-01 6.43e-05 0.00 1.00
  .. .. ..- attr(*, "dimnames")=List of 1
  .. .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
  .. ..$ p   : 'table' num [1:4(1d)] 6.31e-01 6.43e-05 0.00 1.00
  .. .. ..- attr(*, "dimnames")=List of 1
  .. .. .. ..$ : chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
  .. ..$ Q   : Named num [1:4] 1 0.000148 0 1
  .. .. ..- attr(*, "names")= chr [1:4] "CORNER" "FORWARD" "REVERSE" "OTHER"
>```