sjessa/chromswitch

Question about usage with ChromHMM segmentation

josruirod opened this issue · 2 comments

Hello

Thank you for the package, I'm going to give it a try in my project.
However, I'm failing to find detail instructions on how to use it with previous ChromHMM segmentations or what would be the interpretation
For example, if I'm coming from a set of genomic regions of interest, and I have the chromHMM segmentation of 2 different biological conditions, could I get the regions with changing segments between conditions? And would I be able to identify the change that is happening? (the segment/state in one conditions and the one in the other). It confuses me that in the vignette only H3k4me3 peaks are being using, and I would have more than one state/segment in the segmentations.

Coming from the vignette, it would be enough if I load the segmentations in each slot of a genomic ranges list instead of the peaks?

Also, do you have any comment on the time each run of callSummary or callBinary would take?

Thank you for your time

Hi! Sorry for the delay in getting back to you - thanks for checking out the package.

Here are some options for running chromswitch using the output of ChromHMM, with different ways of dealing with the multiple states inferred by ChromHMM.

  1. You could split the BED files from ChromHMM and call switches separately for each state to identify regions of change between conditions.

  2. If there's a state you're particularly interested in, you can filter the ChromHMM BED files to regions with that state assignment, and treat them as peaks. In the chromswitch paper, we used the ChromHMM model from the Roadmap Epigenomics Project and filtered to regions with the state '1_TssA' (active transcription). My code is online here: (see the end of section 3 for importing ChromHMM tracks), and here is the relevant snippet:

# Set 'destpaths' to be a list of paths to BED files containing chromHMM state assignments 
destpaths <- ...

extra_cols <- c("State" = "character")

# Import BED files
chmm_states <- lapply(destpaths_st, rtracklayer::import, format = "bed",
                   extraCols = extra_cols) %>% 
    lapply(as.data.frame) %>% 
    # We will treat regions assigned the state associated with 'active
    # transcription' as the epigenomic features in our analysis
    lapply(filter, State == "1_TssA") %>% 
    lapply(GenomicRanges::makeGRangesFromDataFrame)

# The above command returns a list of GRanges ranges, one per sample
# Name the list according to sample
names(chmm_states) <- ...

# Now the 'chmm_states' object can be passed to the 'peaks' argument
# of chromswitch::callSummary() or chromswitch::callBinary()
  1. I haven't described this in the vignette, but another option is to run the filtering, normalization and feature matrix construction individually for each state (after splitting BED files, importing them, etc) so that you would have one matrix per ChromHMM state, and then concatenate the feature matrices all together before clustering. You could follow section 5 in the vignette for each state, bind the columns of all your feature matrices (they should all have rows in the same order), and finally run chromswitch::cluster(). I originally had this method in mind for combining different data types (e.g. H3K4me & DNase I), but I suppose it could work for different ChromHMM states as well.

Regarding the runtime, it depends on your computational resources, but if parallelization is used, it is possible to run over hundreds of query regions in less than an hour. If you are testing many regions, I recommend to set heatmap = FALSE in the callSummary and callBinary functions, and then generate those later for regions of interest.

Hope that helps!

I'll give it a try, thank you