BiocPy/GenomicRanges

"All objects to combine must have the same number of columns" should not be forced for intersect.

Closed this issue · 9 comments

All objects to combine must have the same number of columns should not be forced for intersect as this makes intersecting e.g. to BED files with different kind of additional columns (or a GTF and a BED file) impossible.

In [43]: g_src = GenomicRanges(
    ...:     seqnames = ["chr1", "chr2", "chr1", "chr3", "chr2"],
    ...:     ranges = IRanges(start =[101, 102, 103, 104, 109], width=[112, 103, 128, 134, 111]),
    ...:     strand = ["*", "-", "*", "+", "-"]
    ...: )
    ...: 
    ...: g_tgt = GenomicRanges(
    ...:     seqnames = ["chr1","chr2","chr2","chr2","chr1","chr1","chr3","chr3","chr3","chr3"],
    ...:     ranges = IRanges(start =range(101, 111), width=range(121, 131)),
    ...:     strand = ["*", "-", "-", "*", "*", "+", "+", "+", "-", "-"],
    ...:     mcols=BiocFrame(
    ...:         {
    ...:             "score": range(0, 10),
    ...:             "GC": [random() for _ in range(10)],
    ...:         }
    ...:     ),
    ...: )
/software/anaconda3/envs/genomicranges/lib/python3.11/site-packages/genomicranges/SeqInfo.py:348: UserWarning: 'seqnames' is deprecated, use 'get_seqnames' instead
  warn("'seqnames' is deprecated, use 'get_seqnames' instead", UserWarning)

In [44]: g_src.intersect(g_tgt)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[44], line 1
----> 1 g_src.intersect(g_tgt)

File /software/anaconda3/envs/genomicranges/lib/python3.11/lib/python3.11/site-packages/genomicranges/GenomicRanges.py:1948, in GenomicRanges.intersect(self, other)
   1945 if not isinstance(other, GenomicRanges):
   1946     raise TypeError("'other' is not a `GenomicRanges` object.")
-> 1948 all_combined = _combine_GenomicRanges(self, other)
   1949 range_bounds = all_combined.range(ignore_strand=True)
   1950 rb_ends = {}

File /software/anaconda3/envs/genomicranges/lib/python3.11/lib/python3.11/site-packages/genomicranges/GenomicRanges.py:2851, in _combine_GenomicRanges(*x)
   2843         else:
   2844             all_names += [""] * len(y)
   2846 return GenomicRanges(
   2847     ranges=ut.combine_sequences(*[y._ranges for y in x]),
   2848     seqnames=ut.combine_sequences(*[y._seqnames for y in x]),
   2849     strand=ut.combine_sequences(*[y._strand for y in x]),
   2850     names=all_names,
-> 2851     mcols=ut.combine_rows(*[y._mcols for y in x]),
   2852     seqinfo=merge_SeqInfo([y._seqinfo for y in x]),
   2853     metadata=x[0]._metadata,
   2854     validate=False,
   2855 )

File /software/anaconda3/envs/genomicranges/lib/python3.11/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
    905 if not args:
    906     raise TypeError(f'{funcname} requires at least '
    907                     '1 positional argument')
--> 909 return dispatch(args[0].__class__)(*args, **kw)

File /software/anaconda3/envs/genomicranges/lib/python3.11/site-packages/biocframe/BiocFrame.py:1340, in _combine_rows_bframes(*x)
   1338         has_rownames = True
   1339     if df.shape[1] != first_nc:
-> 1340         raise ValueError(
   1341             "All objects to combine must have the same number of columns (expected "
   1342             + str(first_nc)
   1343             + ", got "
   1344             + str(df.shape[1])
   1345             + ")."
   1346         )
   1348 new_data = {}
   1349 for i, col in enumerate(x[0]._column_names):

ValueError: All objects to combine must have the same number of columns (expected 0, got 2).

Thank you for finding this issue. we did implement a relaxed combine operation for scenario's like this (https://github.com/BiocPy/BiocFrame/blob/master/src/biocframe/BiocFrame.py#L1487). Should be straightforward for me to switch to that.

@ghuls: Just released a new version 0.4.19 that fixes this. Let me know if you run into any more issues. Thanks again for letting us know about this.

The command now works.

But for my actual usecase (intersecting BED file with bigWig with help of biobear for reading), GenomicRanges is way to slow.
It is already running for more than 1 hour.

import biobear as bb
import polars as pl

from genomicranges import GenomicRanges

session = bb.new_session()

bed = session.read_bed_file("consensus_peaks_bicnn.bed", bb.BEDReadOptions(n_fields=4))

%time bed_df = bed.to_polars()
CPU times: user 225 ms, sys: 17.9 ms, total: 243 ms
Wall time: 223 ms

bigwig = session.read_bigwig_file("pybigtools/Astro.bw")

%time bigwig_df = bigwig.to_polars()
CPU times: user 5.49 s, sys: 1.36 s, total: 6.85 s
Wall time: 4.9 s

%time bed_gr = GenomicRanges.from_polars(bed_df.rename({"reference_sequence_name": "seqnames", "start": "starts", "end": "ends"}))
%time bigwig_gr = GenomicRanges.from_polars(bw_df.rename({"name": "seqnames", "start": "starts", "end": "ends"}))

CPU times: user 29.5 s, sys: 2.94 s, total: 32.4 s
Wall time: 32.4 s

In [60]: bed_df.shape
Out[60]: (546993, 4)

In [61]: bigwig_df.shape
Out[61]: (71164307, 4)

%time bed_bigwig_gr = bed_gr.intersect(bigwig_gr)

py-spy top on the python process gives this output (didn't run it from the start of the intersect call):

Total Samples 118600
GIL: 0.00%, Active: 100.00%, Threads: 2

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                                                                                               
100.00% 100.00%    1184s     1184s   get_end (iranges/IRanges.py)
  0.00% 100.00%    2.00s     1186s   _get_intervals_as_list (iranges/IRanges.py)
  0.00%   0.00%   0.100s    0.150s   start (iranges/IRanges.py)
  0.00% 100.00%   0.080s     1184s   end (iranges/IRanges.py)
  0.00%   0.00%   0.050s    0.050s   get_start (iranges/IRanges.py)
  0.00% 100.00%   0.000s     1186s   start_ipython (IPython/__init__.py)
  0.00% 100.00%   0.000s     1186s   run_cell (IPython/core/interactiveshell.py)
  0.00% 100.00%   0.000s     1186s   _pseudo_sync_runner (IPython/core/async_helpers.py)
  0.00% 100.00%   0.000s     1186s   _run_cell (IPython/core/interactiveshell.py)
  0.00% 100.00%   0.000s     1186s   run_line_magic (IPython/core/interactiveshell.py)
  0.00% 100.00%   0.000s     1186s   start (IPython/terminal/ipapp.py)
  0.00% 100.00%   0.000s     1186s   run_cell_async (IPython/core/interactiveshell.py)
  0.00% 100.00%   0.000s     1186s   run_ast_nodes (IPython/core/interactiveshell.py)
  0.00% 100.00%   0.000s     1186s   intersect (genomicranges/GenomicRanges.py)
  0.00% 100.00%   0.000s     1186s   time (IPython/core/magics/execution.py)
  0.00% 100.00%   0.000s     1186s   gaps (iranges/IRanges.py)
  0.00% 100.00%   0.000s     1186s   run_code (IPython/core/interactiveshell.py)
  0.00% 100.00%   0.000s     1186s   launch_instance (traitlets/config/application.py)
  0.00% 100.00%   0.000s     1186s   <module> (<ipython-input-6-eb92ded4fa41>)
  0.00% 100.00%   0.000s     1186s   interact (IPython/terminal/interactiveshell.py)
  0.00% 100.00%   0.000s     1186s   <module> (ipython)
  0.00% 100.00%   0.000s     1186s   mainloop (IPython/terminal/interactiveshell.py)
  0.00% 100.00%   0.000s     1186s   order (iranges/IRanges.py)
  0.00% 100.00%   0.000s     1186s   <module> (<timed exec>)
  0.00% 100.00%   0.000s     1186s   gaps (genomicranges/GenomicRanges.py)

Not directly related to thing above, but I noticed in the IRanges is the usage of e.g. max(start) and min(start) instead of start.max() and start.min(). Not sure what are common sizes for the numpy arrays used in IRanges, but if they are relatively big, calculating statistics with numpy on them, is much faster:

# Calculating max on a 1957 element numpy array.
In [43]: %timeit b = max(a)
25 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [44]: %timeit b = a.max()
77.3 µs ± 3.77 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

# Calculating max on a 71164307 element numpy array.
In [47]: %time d = c.max()
CPU times: user 30.3 ms, sys: 445 µs, total: 30.8 ms
Wall time: 30.3 ms

In [48]: %time e = max(c)
CPU times: user 3.25 s, sys: 7.06 ms, total: 3.26 s
Wall time: 3.25 s

interesting, thank you for digging deeper. Also can you let me know what numpy version you have?

@ghuls Can you share the links to the two files you mentioned? "consensus_peaks_bicnn.bed" and "pybigtools/Astro.bw" only if they are publicly available?

I updated to the latest iranges. intersect now is at least able to finish within an hour:

In [7]: %time bed_bigwig_gr = bed_gr.intersect(bigwig_gr)
/software/anaconda3/envs/genomicranges/lib/python3.11/site-packages/genomicranges/SeqInfo.py:348: UserWarning: 'seqnames' is deprecated, use 'get_seqnames' instead
  warn("'seqnames' is deprecated, use 'get_seqnames' instead", UserWarning)
CPU times: user 14min 48s, sys: 34.8 s, total: 15min 23s
Wall time: 16min 42s

interesting, thank you for digging deeper. Also can you let me know what numpy version you have?

In [2]: import numpy as np

In [3]: np.__version__
Out[3]: '1.26.4'

In [4]: import genomicranges

In [5]: genomicranges.__version__
Out[5]: '0.4.20'

In [6]: import iranges

In [7]: iranges.__version__
Out[7]: '0.2.9'

Test files are here:
https://temp.aertslab.org/.genomicranges/consensus_peaks_bicnn.bed
https://temp.aertslab.org/.genomicranges/Astro.bw

py-spy top output during intersect step:

Total Samples 124900
GIL: 0.00%, Active: 0.00%, Threads: 3

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                                                                                               
  0.00%   0.00%   444.4s    476.4s   gaps (iranges/IRanges.py)
  0.00%   0.00%   125.0s    133.1s   reduce (iranges/IRanges.py)
  0.00%   0.00%   102.3s    107.2s   seqnames (genomicranges/SeqInfo.py)
  0.00%   0.00%   63.90s    118.6s   _group_indices_by_chrm (genomicranges/GenomicRanges.py)
  0.00%   0.00%   53.67s    99.49s   normalize_subscript (biocutils/normalize_subscript.py)
  0.00%   0.00%   47.42s    99.90s   <listcomp> (genomicranges/GenomicRanges.py)
  0.00%   0.00%   45.82s    45.82s   _is_scalar_bool (biocutils/normalize_subscript.py)
  0.00%   0.00%   29.39s    32.24s   <genexpr> (genomicranges/GenomicRanges.py)
  0.00%   0.00%   24.67s    24.67s   <lambda> (iranges/IRanges.py)
  0.00%   0.00%   10.49s    59.93s   __getitem__ (iranges/IRanges.py)
  0.00%   0.00%    9.04s    33.71s   order (iranges/IRanges.py)
  0.00%   0.00%    9.01s    240.9s   get_subset (genomicranges/GenomicRanges.py)
  0.00%   0.00%    5.11s     9.84s   _sanitize_seqnames (genomicranges/GenomicRanges.py)
  0.00%   0.00%    5.03s     5.03s   <genexpr> (biocutils/subset_sequence.py)
  0.00%   0.00%    4.87s     4.87s   get_seqnames (genomicranges/SeqInfo.py)
  0.00%   0.00%    3.59s     3.59s   _amin (numpy/core/_methods.py)
  0.00%   0.00%    2.85s     2.85s   __len__ (genomicranges/SeqInfo.py)
  0.00%   0.00%    2.56s    28.34s   _validate_seqnames (genomicranges/GenomicRanges.py)
  0.00%   0.00%    2.43s     2.43s   _amax (numpy/core/_methods.py)
  0.00%   0.00%    2.30s     7.33s   _subset_sequence_list (biocutils/subset_sequence.py)
  0.00%   0.00%    1.96s     1.96s   _unique1d (numpy/lib/arraysetops.py)
  0.00%   0.00%    1.40s    173.9s   range (genomicranges/GenomicRanges.py)
  0.00%   0.00%    1.22s     1003s   intersect (genomicranges/GenomicRanges.py)
  0.00%   0.00%    1.14s     7.60s   _validate_optional_attrs (genomicranges/GenomicRanges.py)
  0.00%   0.00%   0.980s    0.980s   _combine_sequences_lists (biocutils/combine_sequences.py)
  0.00%   0.00%   0.490s    681.3s   gaps (genomicranges/GenomicRanges.py)
  0.00%   0.00%   0.340s    144.4s   reduce (genomicranges/GenomicRanges.py)
  0.00%   0.00%   0.330s     1004s   <module> (<timed exec>)
  0.00%   0.00%   0.290s    0.290s   _combine_sequences_dense_arrays (biocutils/combine_sequences.py)
  0.00%   0.00%   0.270s    0.270s   copy_line (prompt_toolkit/layout/containers.py)
  0.00%   0.00%   0.230s    0.230s   _sanitize_width (iranges/IRanges.py)
  0.00%   0.00%   0.220s    0.270s   _validate_width (iranges/IRanges.py)
  0.00%   0.00%   0.200s    47.96s   __init__ (genomicranges/GenomicRanges.py)
  0.00%   0.00%   0.200s    144.6s   union (genomicranges/GenomicRanges.py)
  0.00%   0.00%   0.190s    0.190s   _sanitize_start (iranges/IRanges.py)
  0.00%   0.00%   0.140s    0.140s   _construct_missing (biocframe/BiocFrame.py)
  0.00%   0.00%   0.140s    186.5s   setdiff (genomicranges/GenomicRanges.py)
  0.00%   0.00%   0.100s    0.100s   <listcomp> (biocutils/subset_sequence.py)
  0.00%   0.00%   0.050s    0.050s   _any (numpy/core/_methods.py)
  0.00%   0.00%   0.040s     8.88s   wrapper (functools.py)
  0.00%   0.00%   0.020s    0.020s   get_end (iranges/IRanges.py)
  0.00%   0.00%   0.020s     1.98s   sanitize_strand_vector (genomicranges/utils.py)
  0.00%   0.00%   0.010s    0.010s   __init__ (asyncio/events.py)
  0.00%   0.00%   0.010s    0.010s   _writeout_input_cache (IPython/core/history.py)
  0.00%   0.00%   0.000s    0.010s   call_soon_threadsafe (asyncio/base_events.py)
  0.00%   0.00%   0.000s     1004s   _run_cell (IPython/core/interactiveshell.py)
  0.00%   0.00%   0.000s    0.010s   _bootstrap_inner (threading.py)
  0.00%   0.00%   0.000s    0.420s   __copy__ (genomicranges/GenomicRanges.py)
  0.00%   0.00%   0.000s     2.13s   _combine_GenomicRanges (genomicranges/GenomicRanges.py)

https://github.com/BiocPy/IRanges/blob/master/src/iranges/IRanges.py#L117-L127

_sanitize_start and _sanitize_end can be rewriten like this, I think:

def _sanitize_start(self, start):
    return np.asarray(start, dtype=np.int32)

From np.asarray doc string:

Returns
-------
out : ndarray
 Array interpretation of `a`.  No copy is performed if the input
    is already an ndarray with matching dtype and order.  If `a` is a
    subclass of ndarray, a base class ndarray is returned.

Not great yet, but this now dropped to 3 minutes by switching to NCLS to perform the interval operation (intersect_ncls is the new method if you want to check it out).

Dropped to a minute and a half. But I'm going to be looking into integrating with c/rust based indexing libraries for faster performance. Thank you for bringing this up. going to close this for now.