nvictus/pybbi

pybii+dataframes

gfudenberg opened this issue · 3 comments

There are two patterns that come up a lot at the intersection of pybbi/bioframe:

  1. generate a matrix of [n_intervals x nbins] with binned ChIP signal across each of the intervals in an interval_df (then often visualized as a heatmap)
  2. generate a vector of [n_intervals x 1 bins] of total ChIP signal across each interval and append the result to the input interval_df

Is this something that would make sense in pybbi? in bioframe? elsewhere?

Example for (1) below:

def generate_signal_matrix(
    interval_df,
    chip_seq_file,
    columns=["chrom", "start", "end"],
    window_size=1000,
    window_type="extend",
    nbins=40,
):
    """
    Uses pybbi to measure signal over a set of input intervals.
    Returns a matrix [n_intervals x nbins] of the average of ChIP signal over
    each bin in the matrix.

    Parameters:
    -----------
    interval_df: pandas dataframe that has the list of intervals (usually,
                set of genes)
    chip_seq_file: filepath to ChIP-seq file of type bigWig or bigBed
    columns: the columns in interval_df that define the interval
    window_size: the distance of the
    window_type: 'extend': window_size defines padding length added to ends
                 'centered': window_size extends out from each side of middle
    nbins: number of bins the window is divided into.
    Returns:
    --------
    matrix: np.array of shape [n_intervals x nbins]
            rows correspond to intervals in interval_df and ChIP signal measured
            across column bins.
    """

    intervals = bioframe.from_any(interval_df[columns])
    intervals = intervals.rename(
        columns={columns[0]: "chrom", columns[1]: "start", columns[2]: "end"}
    )
    intervals = bioframe.sanitize_bedframe(intervals)

    if window_type == "extend":
        shapes = pd.Series(intervals["start"] - intervals["end"]).nunique()
        msg = """The size of intervals should be equal to perform stackup.
                 Try window_type: 'centered'"""
        assert shapes == 1, msg

        intervals = bioframe.expand(intervals, pad=window_size)

    if window_type == "centered":
        intervals = bioframe.expand(bioframe.expand(expanded, scale=0), pad=1000)

    with bbi.open(chip_seq_file) as f:
        matrix = f.stackup(
            intervals["chrom"], intervals["start"], intervals["end"], bins=nbins
        )

    return matrix

re: (1), i'd say it would fit in pybbi very well. One question is though - would we like pybbi to depend on bioframe or vice versa? I'd vote for the former, as bioframe is a pure python dependency, while pybbi requires compiling.
re: (2), agreed, but i'd suggest returning a column by default (e.g., have a kwarg append=False)

good suggestions! Re: (1), that was my thinking as well.

any thoughts @nvictus ?

Discussed this with @gfudenberg and @GarrettNg.

Conclusion was that this functionality would best start its life in the bioframe sandbox. It could mature into bioframe proper, or if we decide to keep bioframe lean, it may graduate to its own project.

For several reasons, I think pybbi should try to remain a low-level access API over what UCSC's source code provides to interface to numpy/pandas.

There are already a couple fileops implementations in bioframe that allow multiple bbi "engine" backends (pybbi and pybigwig) and I think this pattern is good because different engines have strengths and weaknesses (e.g. some UCSC ops just fail on ENCODE bigWig files). Neither of these libraries are hard dependencies in bioframe, and an error message indicates to the user that at least one of them needs to be installed -- this was modeled off the pyarrow/fastparquet soft dependence in pandas.