"All objects to combine must have the same number of columns" should not be forced for intersect.
Closed this issue · 9 comments
All objects to combine must have the same number of columns
should not be forced for intersect as this makes intersecting e.g. to BED files with different kind of additional columns (or a GTF and a BED file) impossible.
In [43]: g_src = GenomicRanges(
...: seqnames = ["chr1", "chr2", "chr1", "chr3", "chr2"],
...: ranges = IRanges(start =[101, 102, 103, 104, 109], width=[112, 103, 128, 134, 111]),
...: strand = ["*", "-", "*", "+", "-"]
...: )
...:
...: g_tgt = GenomicRanges(
...: seqnames = ["chr1","chr2","chr2","chr2","chr1","chr1","chr3","chr3","chr3","chr3"],
...: ranges = IRanges(start =range(101, 111), width=range(121, 131)),
...: strand = ["*", "-", "-", "*", "*", "+", "+", "+", "-", "-"],
...: mcols=BiocFrame(
...: {
...: "score": range(0, 10),
...: "GC": [random() for _ in range(10)],
...: }
...: ),
...: )
/software/anaconda3/envs/genomicranges/lib/python3.11/site-packages/genomicranges/SeqInfo.py:348: UserWarning: 'seqnames' is deprecated, use 'get_seqnames' instead
warn("'seqnames' is deprecated, use 'get_seqnames' instead", UserWarning)
In [44]: g_src.intersect(g_tgt)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[44], line 1
----> 1 g_src.intersect(g_tgt)
File /software/anaconda3/envs/genomicranges/lib/python3.11/lib/python3.11/site-packages/genomicranges/GenomicRanges.py:1948, in GenomicRanges.intersect(self, other)
1945 if not isinstance(other, GenomicRanges):
1946 raise TypeError("'other' is not a `GenomicRanges` object.")
-> 1948 all_combined = _combine_GenomicRanges(self, other)
1949 range_bounds = all_combined.range(ignore_strand=True)
1950 rb_ends = {}
File /software/anaconda3/envs/genomicranges/lib/python3.11/lib/python3.11/site-packages/genomicranges/GenomicRanges.py:2851, in _combine_GenomicRanges(*x)
2843 else:
2844 all_names += [""] * len(y)
2846 return GenomicRanges(
2847 ranges=ut.combine_sequences(*[y._ranges for y in x]),
2848 seqnames=ut.combine_sequences(*[y._seqnames for y in x]),
2849 strand=ut.combine_sequences(*[y._strand for y in x]),
2850 names=all_names,
-> 2851 mcols=ut.combine_rows(*[y._mcols for y in x]),
2852 seqinfo=merge_SeqInfo([y._seqinfo for y in x]),
2853 metadata=x[0]._metadata,
2854 validate=False,
2855 )
File /software/anaconda3/envs/genomicranges/lib/python3.11/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
905 if not args:
906 raise TypeError(f'{funcname} requires at least '
907 '1 positional argument')
--> 909 return dispatch(args[0].__class__)(*args, **kw)
File /software/anaconda3/envs/genomicranges/lib/python3.11/site-packages/biocframe/BiocFrame.py:1340, in _combine_rows_bframes(*x)
1338 has_rownames = True
1339 if df.shape[1] != first_nc:
-> 1340 raise ValueError(
1341 "All objects to combine must have the same number of columns (expected "
1342 + str(first_nc)
1343 + ", got "
1344 + str(df.shape[1])
1345 + ")."
1346 )
1348 new_data = {}
1349 for i, col in enumerate(x[0]._column_names):
ValueError: All objects to combine must have the same number of columns (expected 0, got 2).
Thank you for finding this issue. we did implement a relaxed combine operation for scenario's like this (https://github.com/BiocPy/BiocFrame/blob/master/src/biocframe/BiocFrame.py#L1487). Should be straightforward for me to switch to that.
@ghuls: Just released a new version 0.4.19 that fixes this. Let me know if you run into any more issues. Thanks again for letting us know about this.
The command now works.
But for my actual usecase (intersecting BED file with bigWig with help of biobear for reading), GenomicRanges is way to slow.
It is already running for more than 1 hour.
import biobear as bb
import polars as pl
from genomicranges import GenomicRanges
session = bb.new_session()
bed = session.read_bed_file("consensus_peaks_bicnn.bed", bb.BEDReadOptions(n_fields=4))
%time bed_df = bed.to_polars()
CPU times: user 225 ms, sys: 17.9 ms, total: 243 ms
Wall time: 223 ms
bigwig = session.read_bigwig_file("pybigtools/Astro.bw")
%time bigwig_df = bigwig.to_polars()
CPU times: user 5.49 s, sys: 1.36 s, total: 6.85 s
Wall time: 4.9 s
%time bed_gr = GenomicRanges.from_polars(bed_df.rename({"reference_sequence_name": "seqnames", "start": "starts", "end": "ends"}))
%time bigwig_gr = GenomicRanges.from_polars(bw_df.rename({"name": "seqnames", "start": "starts", "end": "ends"}))
CPU times: user 29.5 s, sys: 2.94 s, total: 32.4 s
Wall time: 32.4 s
In [60]: bed_df.shape
Out[60]: (546993, 4)
In [61]: bigwig_df.shape
Out[61]: (71164307, 4)
%time bed_bigwig_gr = bed_gr.intersect(bigwig_gr)
py-spy top
on the python process gives this output (didn't run it from the start of the intersect call):
Total Samples 118600
GIL: 0.00%, Active: 100.00%, Threads: 2
%Own %Total OwnTime TotalTime Function (filename)
100.00% 100.00% 1184s 1184s get_end (iranges/IRanges.py)
0.00% 100.00% 2.00s 1186s _get_intervals_as_list (iranges/IRanges.py)
0.00% 0.00% 0.100s 0.150s start (iranges/IRanges.py)
0.00% 100.00% 0.080s 1184s end (iranges/IRanges.py)
0.00% 0.00% 0.050s 0.050s get_start (iranges/IRanges.py)
0.00% 100.00% 0.000s 1186s start_ipython (IPython/__init__.py)
0.00% 100.00% 0.000s 1186s run_cell (IPython/core/interactiveshell.py)
0.00% 100.00% 0.000s 1186s _pseudo_sync_runner (IPython/core/async_helpers.py)
0.00% 100.00% 0.000s 1186s _run_cell (IPython/core/interactiveshell.py)
0.00% 100.00% 0.000s 1186s run_line_magic (IPython/core/interactiveshell.py)
0.00% 100.00% 0.000s 1186s start (IPython/terminal/ipapp.py)
0.00% 100.00% 0.000s 1186s run_cell_async (IPython/core/interactiveshell.py)
0.00% 100.00% 0.000s 1186s run_ast_nodes (IPython/core/interactiveshell.py)
0.00% 100.00% 0.000s 1186s intersect (genomicranges/GenomicRanges.py)
0.00% 100.00% 0.000s 1186s time (IPython/core/magics/execution.py)
0.00% 100.00% 0.000s 1186s gaps (iranges/IRanges.py)
0.00% 100.00% 0.000s 1186s run_code (IPython/core/interactiveshell.py)
0.00% 100.00% 0.000s 1186s launch_instance (traitlets/config/application.py)
0.00% 100.00% 0.000s 1186s <module> (<ipython-input-6-eb92ded4fa41>)
0.00% 100.00% 0.000s 1186s interact (IPython/terminal/interactiveshell.py)
0.00% 100.00% 0.000s 1186s <module> (ipython)
0.00% 100.00% 0.000s 1186s mainloop (IPython/terminal/interactiveshell.py)
0.00% 100.00% 0.000s 1186s order (iranges/IRanges.py)
0.00% 100.00% 0.000s 1186s <module> (<timed exec>)
0.00% 100.00% 0.000s 1186s gaps (genomicranges/GenomicRanges.py)
Not directly related to thing above, but I noticed in the IRanges is the usage of e.g. max(start)
and min(start)
instead of start.max()
and start.min()
. Not sure what are common sizes for the numpy arrays used in IRanges, but if they are relatively big, calculating statistics with numpy on them, is much faster:
# Calculating max on a 1957 element numpy array.
In [43]: %timeit b = max(a)
25 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [44]: %timeit b = a.max()
77.3 µs ± 3.77 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Calculating max on a 71164307 element numpy array.
In [47]: %time d = c.max()
CPU times: user 30.3 ms, sys: 445 µs, total: 30.8 ms
Wall time: 30.3 ms
In [48]: %time e = max(c)
CPU times: user 3.25 s, sys: 7.06 ms, total: 3.26 s
Wall time: 3.25 s
interesting, thank you for digging deeper. Also can you let me know what numpy version you have?
@ghuls Can you share the links to the two files you mentioned? "consensus_peaks_bicnn.bed" and "pybigtools/Astro.bw" only if they are publicly available?
I updated to the latest iranges. intersect now is at least able to finish within an hour:
In [7]: %time bed_bigwig_gr = bed_gr.intersect(bigwig_gr)
/software/anaconda3/envs/genomicranges/lib/python3.11/site-packages/genomicranges/SeqInfo.py:348: UserWarning: 'seqnames' is deprecated, use 'get_seqnames' instead
warn("'seqnames' is deprecated, use 'get_seqnames' instead", UserWarning)
CPU times: user 14min 48s, sys: 34.8 s, total: 15min 23s
Wall time: 16min 42s
interesting, thank you for digging deeper. Also can you let me know what numpy version you have?
In [2]: import numpy as np
In [3]: np.__version__
Out[3]: '1.26.4'
In [4]: import genomicranges
In [5]: genomicranges.__version__
Out[5]: '0.4.20'
In [6]: import iranges
In [7]: iranges.__version__
Out[7]: '0.2.9'
Test files are here:
https://temp.aertslab.org/.genomicranges/consensus_peaks_bicnn.bed
https://temp.aertslab.org/.genomicranges/Astro.bw
py-spy top
output during intersect step:
Total Samples 124900
GIL: 0.00%, Active: 0.00%, Threads: 3
%Own %Total OwnTime TotalTime Function (filename)
0.00% 0.00% 444.4s 476.4s gaps (iranges/IRanges.py)
0.00% 0.00% 125.0s 133.1s reduce (iranges/IRanges.py)
0.00% 0.00% 102.3s 107.2s seqnames (genomicranges/SeqInfo.py)
0.00% 0.00% 63.90s 118.6s _group_indices_by_chrm (genomicranges/GenomicRanges.py)
0.00% 0.00% 53.67s 99.49s normalize_subscript (biocutils/normalize_subscript.py)
0.00% 0.00% 47.42s 99.90s <listcomp> (genomicranges/GenomicRanges.py)
0.00% 0.00% 45.82s 45.82s _is_scalar_bool (biocutils/normalize_subscript.py)
0.00% 0.00% 29.39s 32.24s <genexpr> (genomicranges/GenomicRanges.py)
0.00% 0.00% 24.67s 24.67s <lambda> (iranges/IRanges.py)
0.00% 0.00% 10.49s 59.93s __getitem__ (iranges/IRanges.py)
0.00% 0.00% 9.04s 33.71s order (iranges/IRanges.py)
0.00% 0.00% 9.01s 240.9s get_subset (genomicranges/GenomicRanges.py)
0.00% 0.00% 5.11s 9.84s _sanitize_seqnames (genomicranges/GenomicRanges.py)
0.00% 0.00% 5.03s 5.03s <genexpr> (biocutils/subset_sequence.py)
0.00% 0.00% 4.87s 4.87s get_seqnames (genomicranges/SeqInfo.py)
0.00% 0.00% 3.59s 3.59s _amin (numpy/core/_methods.py)
0.00% 0.00% 2.85s 2.85s __len__ (genomicranges/SeqInfo.py)
0.00% 0.00% 2.56s 28.34s _validate_seqnames (genomicranges/GenomicRanges.py)
0.00% 0.00% 2.43s 2.43s _amax (numpy/core/_methods.py)
0.00% 0.00% 2.30s 7.33s _subset_sequence_list (biocutils/subset_sequence.py)
0.00% 0.00% 1.96s 1.96s _unique1d (numpy/lib/arraysetops.py)
0.00% 0.00% 1.40s 173.9s range (genomicranges/GenomicRanges.py)
0.00% 0.00% 1.22s 1003s intersect (genomicranges/GenomicRanges.py)
0.00% 0.00% 1.14s 7.60s _validate_optional_attrs (genomicranges/GenomicRanges.py)
0.00% 0.00% 0.980s 0.980s _combine_sequences_lists (biocutils/combine_sequences.py)
0.00% 0.00% 0.490s 681.3s gaps (genomicranges/GenomicRanges.py)
0.00% 0.00% 0.340s 144.4s reduce (genomicranges/GenomicRanges.py)
0.00% 0.00% 0.330s 1004s <module> (<timed exec>)
0.00% 0.00% 0.290s 0.290s _combine_sequences_dense_arrays (biocutils/combine_sequences.py)
0.00% 0.00% 0.270s 0.270s copy_line (prompt_toolkit/layout/containers.py)
0.00% 0.00% 0.230s 0.230s _sanitize_width (iranges/IRanges.py)
0.00% 0.00% 0.220s 0.270s _validate_width (iranges/IRanges.py)
0.00% 0.00% 0.200s 47.96s __init__ (genomicranges/GenomicRanges.py)
0.00% 0.00% 0.200s 144.6s union (genomicranges/GenomicRanges.py)
0.00% 0.00% 0.190s 0.190s _sanitize_start (iranges/IRanges.py)
0.00% 0.00% 0.140s 0.140s _construct_missing (biocframe/BiocFrame.py)
0.00% 0.00% 0.140s 186.5s setdiff (genomicranges/GenomicRanges.py)
0.00% 0.00% 0.100s 0.100s <listcomp> (biocutils/subset_sequence.py)
0.00% 0.00% 0.050s 0.050s _any (numpy/core/_methods.py)
0.00% 0.00% 0.040s 8.88s wrapper (functools.py)
0.00% 0.00% 0.020s 0.020s get_end (iranges/IRanges.py)
0.00% 0.00% 0.020s 1.98s sanitize_strand_vector (genomicranges/utils.py)
0.00% 0.00% 0.010s 0.010s __init__ (asyncio/events.py)
0.00% 0.00% 0.010s 0.010s _writeout_input_cache (IPython/core/history.py)
0.00% 0.00% 0.000s 0.010s call_soon_threadsafe (asyncio/base_events.py)
0.00% 0.00% 0.000s 1004s _run_cell (IPython/core/interactiveshell.py)
0.00% 0.00% 0.000s 0.010s _bootstrap_inner (threading.py)
0.00% 0.00% 0.000s 0.420s __copy__ (genomicranges/GenomicRanges.py)
0.00% 0.00% 0.000s 2.13s _combine_GenomicRanges (genomicranges/GenomicRanges.py)
https://github.com/BiocPy/IRanges/blob/master/src/iranges/IRanges.py#L117-L127
_sanitize_start and _sanitize_end can be rewriten like this, I think:
def _sanitize_start(self, start):
return np.asarray(start, dtype=np.int32)
From np.asarray
doc string:
Returns
-------
out : ndarray
Array interpretation of `a`. No copy is performed if the input
is already an ndarray with matching dtype and order. If `a` is a
subclass of ndarray, a base class ndarray is returned.
Not great yet, but this now dropped to 3 minutes by switching to NCLS to perform the interval operation (intersect_ncls
is the new method if you want to check it out).
Dropped to a minute and a half. But I'm going to be looking into integrating with c/rust based indexing libraries for faster performance. Thank you for bringing this up. going to close this for now.