Allow start==stop for point data
Closed this issue · 5 comments
Summary
I think you should be able to indicate the same start and stop column when you've got point data like SNPs.
There may be cases where this doesn't make sense, but I think there can be different validations for these cases.
Details
From the definitions:
Interval:
- An interval is a tuple of integers (start, end) with start <= end.
...- A special case where start and end are the same, i.e. [X, X), is interpreted as a point (aka an empty interval, i.e. an edge between 1-bp bins). A point has zero length.
Buuut, if I have point data like:
import pandas as pd, bioframe
df = pd.DataFrame({
"contig": pd.Categorical.from_codes([0, 0, 0, 1, 2, 2], ["chr1", "chr2", "chr3"]),
"position": [1, 100, 200, 300, 301, 306]
})
bioframe.select(df, "chr1:50-250", ("contig", "position", "position"))
# ValueError: column names must be unique
Full ValueError traceback
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [16], line 1
----> 1 bioframe.select(df, "chr1:50-250", ("contig", "position", "position"))
File ~/miniconda3/envs/sgkit-bioframe/lib/python3.10/site-packages/bioframe/ops.py:55, in select(df, region, cols)
31 """
32 Return all genomic intervals in a dataframe that overlap a genomic region.
33
(...)
51
52 """
54 ck, sk, ek = _get_default_colnames() if cols is None else cols
---> 55 checks.is_bedframe(df, raise_errors=True, cols=[ck, sk, ek])
57 chrom, start, end = parse_region(region)
58 if chrom is None:
File ~/miniconda3/envs/sgkit-bioframe/lib/python3.10/site-packages/bioframe/core/checks.py:54, in is_bedframe(df, raise_errors, cols)
24 """
25 Checks that required bedframe properties are satisfied for dataframe `df`.
26
(...)
50
51 """
52 ck1, sk1, ek1 = _get_default_colnames() if cols is None else cols
---> 54 if not _verify_columns(df, [ck1, sk1, ek1], return_as_bool=True):
55 if raise_errors:
56 raise TypeError("Invalid bedFrame: Invalid column names")
File ~/miniconda3/envs/sgkit-bioframe/lib/python3.10/site-packages/bioframe/core/specs.py:82, in _verify_columns(df, colnames, return_as_bool)
79 raise ValueError("df is not a dataframe")
81 if len(set(colnames)) < len(colnames):
---> 82 raise ValueError("column names must be unique")
84 if not set(colnames).issubset(df.columns):
85 if return_as_bool:
ValueError: column names must be unique
This can be worked around by making a copy of the column:
df["position_copy"] = df["position"]
bioframe.select(df, "chr1:50-250", ("contig", "position", "position_copy"))
But this isn't ideal. Instead, I think it would be nice if I could either use the same column for start and end or use a 2-tuple argument to indicate point data. I think the former would be more in line with the definitions, in particular "[X, X), is interpreted as a point".
More real-world example
As a motivating example, it would be nice to be able to use bioframe more natively with sgkit
:
Where this would look like:
import sgkit as sg
# Using example data from https://pystatgen.github.io/sgkit/latest/examples/gwas_tutorial.html
ds = sg.load_dataset("1kg.zarr")
df_variant = (
ds
.drop_dims(set(ds.dims) - set(["variants"]))
.to_dataframe()
.assign(
variant_contig_name=lambda x: pd.Categorical.from_codes(x["variant_contig"], ds.contigs)
)
)
bioframe.select(
df_variant,
"X:152660490-153706321",
("variant_contig_name", "variant_position", "variant_position")
)
But currently has to be:
df_variant["variant_position_copy"] = df_variant["variant_position"]
bioframe.select(
df_variant,
"X:152660490-153706321",
("variant_contig_name", "variant_position", "variant_position_copy")
)
Thanks for bringing this up-- we also ran into this in make_chromarms()
and solved it similarly inelegantly...
Off the top of my head, expand
would need some special treatment but other things might just work if we relaxed the behavior of _verify_columns
.
There may have been some benefit of strictly requiring unique columns since there is a closed issue about cryptic errors with repeated columns: #61. I can't remember what that may have been, though... perhaps others might? cc @Phlya @nvictus @golobor
Are there any new operations that are done frequently with chrom,pos dataframes (pointFrames?) that wouldn't be done with chrom,start,end bedframes?
After a brief discussion, any thoughts on end=start vs end=start+1? When would one want one vs the other in real world?
How about either
- we add the step of inserting an extra column for "end" internally in bioframe and assign it start+1 (or just start?)
- you simply add inside the bioframe.select
df_variant.assign(variant_position_end=df_variant['variant_position']+1)
(or no +1) - perhaps this is more convenient than assigning in a separate line - potentially we could also simply relax the restriction on column names, after looking into the cryptic error described above
probably still need to patch
- expand
- trim
by adding a kwarg to is_bedframe for unique column verification.
Closing as resolved. If patches are needed for expand and trim, we can open a new issue.