open2c/bioframe

Allow start==stop for point data

Closed this issue · 5 comments

Summary

I think you should be able to indicate the same start and stop column when you've got point data like SNPs.

There may be cases where this doesn't make sense, but I think there can be different validations for these cases.

Details

From the definitions:

Interval:

  • An interval is a tuple of integers (start, end) with start <= end.
    ...
  • A special case where start and end are the same, i.e. [X, X), is interpreted as a point (aka an empty interval, i.e. an edge between 1-bp bins). A point has zero length.

Buuut, if I have point data like:

import pandas as pd, bioframe

df = pd.DataFrame({
    "contig": pd.Categorical.from_codes([0, 0, 0, 1, 2, 2], ["chr1", "chr2", "chr3"]),
    "position": [1, 100, 200, 300, 301, 306]
})

bioframe.select(df, "chr1:50-250", ("contig", "position", "position"))
# ValueError: column names must be unique
Full ValueError traceback
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [16], line 1
----> 1 bioframe.select(df, "chr1:50-250", ("contig", "position", "position"))

File ~/miniconda3/envs/sgkit-bioframe/lib/python3.10/site-packages/bioframe/ops.py:55, in select(df, region, cols)
     31 """
     32 Return all genomic intervals in a dataframe that overlap a genomic region.
     33 
   (...)
     51 
     52 """
     54 ck, sk, ek = _get_default_colnames() if cols is None else cols
---> 55 checks.is_bedframe(df, raise_errors=True, cols=[ck, sk, ek])
     57 chrom, start, end = parse_region(region)
     58 if chrom is None:

File ~/miniconda3/envs/sgkit-bioframe/lib/python3.10/site-packages/bioframe/core/checks.py:54, in is_bedframe(df, raise_errors, cols)
     24 """
     25 Checks that required bedframe properties are satisfied for dataframe `df`.
     26 
   (...)
     50 
     51 """
     52 ck1, sk1, ek1 = _get_default_colnames() if cols is None else cols
---> 54 if not _verify_columns(df, [ck1, sk1, ek1], return_as_bool=True):
     55     if raise_errors:
     56         raise TypeError("Invalid bedFrame: Invalid column names")

File ~/miniconda3/envs/sgkit-bioframe/lib/python3.10/site-packages/bioframe/core/specs.py:82, in _verify_columns(df, colnames, return_as_bool)
     79     raise ValueError("df is not a dataframe")
     81 if len(set(colnames)) < len(colnames):
---> 82     raise ValueError("column names must be unique")
     84 if not set(colnames).issubset(df.columns):
     85     if return_as_bool:

ValueError: column names must be unique

This can be worked around by making a copy of the column:

df["position_copy"] = df["position"]

bioframe.select(df, "chr1:50-250", ("contig", "position", "position_copy"))

But this isn't ideal. Instead, I think it would be nice if I could either use the same column for start and end or use a 2-tuple argument to indicate point data. I think the former would be more in line with the definitions, in particular "[X, X), is interpreted as a point".

More real-world example

As a motivating example, it would be nice to be able to use bioframe more natively with sgkit:

Where this would look like:

import sgkit as sg

# Using example data from https://pystatgen.github.io/sgkit/latest/examples/gwas_tutorial.html
ds = sg.load_dataset("1kg.zarr")

df_variant = (
    ds
    .drop_dims(set(ds.dims) - set(["variants"]))
    .to_dataframe()
    .assign(
        variant_contig_name=lambda x: pd.Categorical.from_codes(x["variant_contig"], ds.contigs)
    )
)

bioframe.select(
    df_variant,
    "X:152660490-153706321",
    ("variant_contig_name", "variant_position", "variant_position")
)

But currently has to be:

df_variant["variant_position_copy"] = df_variant["variant_position"]

bioframe.select(
    df_variant,
    "X:152660490-153706321",
    ("variant_contig_name", "variant_position", "variant_position_copy")
)

Thanks for bringing this up-- we also ran into this in make_chromarms() and solved it similarly inelegantly...

Off the top of my head, expand would need some special treatment but other things might just work if we relaxed the behavior of _verify_columns.

There may have been some benefit of strictly requiring unique columns since there is a closed issue about cryptic errors with repeated columns: #61. I can't remember what that may have been, though... perhaps others might? cc @Phlya @nvictus @golobor

Are there any new operations that are done frequently with chrom,pos dataframes (pointFrames?) that wouldn't be done with chrom,start,end bedframes?

Phlya commented

After a brief discussion, any thoughts on end=start vs end=start+1? When would one want one vs the other in real world?

How about either

  1. we add the step of inserting an extra column for "end" internally in bioframe and assign it start+1 (or just start?)
  2. you simply add inside the bioframe.select df_variant.assign(variant_position_end=df_variant['variant_position']+1) (or no +1) - perhaps this is more convenient than assigning in a separate line
  3. potentially we could also simply relax the restriction on column names, after looking into the cryptic error described above

@Phlya's point 3 is now done in #131

probably still need to patch

  • expand
  • trim
    by adding a kwarg to is_bedframe for unique column verification.

Closing as resolved. If patches are needed for expand and trim, we can open a new issue.