alternative for `seqkit locate` that can locate subsequences/motifs
abearab opened this issue · 12 comments
I found seqkit locate very useful but it was surprisingly slow. Would you think there is any biobear
approach for that? In general, any seqkit functions are very common and useful – in case you are interested to look deeper :)
It's a very nice package overall, and to your point would make for a good reference to build a "recipe" list from.
To your question, depending on the end goal, there's a somewhat crude way you could accomplish this in biobear, but it may not fit your needs well enough. This relates back to your comments trimming, in that I'm looking at performant alignment / string functions that I can add to biobear
to better support these use cases.
For example, use the seqkit example agctggagctacc
...
-- if you wanted to match match against an adapter at the start and allow for two mismatches
❯ SELECT levenshtein(substr('agctggagctacc', 1, 5), 'agcaa') dist;
+------+
| dist |
+------+
| 2 |
+------+
You could then add that to a WHERE
clause.
Perhaps, I'll add a function that replicates locate
, but returns a list of matches... e.g.
SELECT locate(sequence_column, pattern, options...) -> list of structs {seqid, pattern, strand, start, end, matched_pattern}
Then more or less have the returned table be aligned with seqkits
example? Is that a useful function.
Thanks @tshauck for your response. I think that's interesting addition (it's interesting that levenshtein
distance is already implemented; I didn't know that! I'll need to ask you another question about that but I'll leave this question clear for now).
For the locate
question here, I had some specific thought which is described here – https://github.com/abearab/PosSpacer. I just made this a public repo so you can see my algorithm design notes. However, I gave up on that until I could replace my codes with biobear
! The preprocessed NGS reads needs to be counted and the location of spacers in sequencing reads would be nice to be reported.
@abearab, cool, that makes sense to me... I just updated biobear on pypi with the new quality score function as well as an alignment_score function that does basic local alignment between two sequences (e.g. e69df6b). I'll explore more how to return alignment position(s) within a string and follow up when I have some more info.
@tshauck I have issue installing it
(this might be a separate issue, I just created a new VM and fresh OS, so ...)
× Building wheel for biobear (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [168 lines of output]
Running `maturin pep517 build-wheel -i /home/abearab/miniforge3/envs/screenpro2/bin/python --compatibility off`
🍹 Building a mixed python/rust project
🔗 Found pyo3 bindings
🐍 Found CPython 3.9 at /home/abearab/miniforge3/envs/screenpro2/bin/python
📡 Using build options features from pyproject.toml
Compiling libc v0.2.153
...
The following warnings were emitted during compilation:
warning: ring@0.17.8: Failed to run: "cc" "--version"
error: failed to run custom build command for `ring v0.17.8`
Caused by:
process didn't exit successfully: `/tmp/pip-install-r9v2vcz0/biobear_b2d0e519e8c74f2988c3017387959e41/target/release/build/ring-646b006cd1874eb5/build-script-build` (exit status: 1)
--- stdout
cargo:rerun-if-env-changed=RING_PREGENERATE_ASM
cargo:rustc-env=RING_CORE_PREFIX=ring_core_0_17_8_
OPT_LEVEL = Some("3")
TARGET = Some("x86_64-unknown-linux-gnu")
HOST = Some("x86_64-unknown-linux-gnu")
cargo:rerun-if-env-changed=CC_x86_64-unknown-linux-gnu
CC_x86_64-unknown-linux-gnu = None
cargo:rerun-if-env-changed=CC_x86_64_unknown_linux_gnu
CC_x86_64_unknown_linux_gnu = None
cargo:rerun-if-env-changed=HOST_CC
HOST_CC = None
cargo:rerun-if-env-changed=CC
CC = None
cargo:rerun-if-env-changed=CC_ENABLE_DEBUG_OUTPUT
cargo:warning=Failed to run: "cc" "--version"
cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
CRATE_CC_NO_DEFAULTS = None
DEBUG = Some("false")
CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
cargo:rerun-if-env-changed=CFLAGS_x86_64-unknown-linux-gnu
CFLAGS_x86_64-unknown-linux-gnu = None
cargo:rerun-if-env-changed=CFLAGS_x86_64_unknown_linux_gnu
CFLAGS_x86_64_unknown_linux_gnu = None
cargo:rerun-if-env-changed=HOST_CFLAGS
HOST_CFLAGS = None
cargo:rerun-if-env-changed=CFLAGS
CFLAGS = None
--- stderr
error occurred: Failed to find tool. Is `cc` installed?
warning: build failed, waiting for other jobs to finish...
💥 maturin failed
Caused by: Failed to build a native library through cargo
Caused by: Cargo build finished with "exit status: 101": `env -u CARGO PYO3_ENVIRONMENT_SIGNATURE="cpython-3.9-64bit" PYO3_PYTHON="/home/abearab/miniforge3/envs/screenpro2/bin/python" PYTHON_SYS_EXECUTABLE="/home/abearab/miniforge3/envs/screenpro2/bin/python" "cargo" "rustc" "--features" "pyo3/extension-module" "--message-format" "json-render-diagnostics" "--manifest-path" "/tmp/pip-install-r9v2vcz0/biobear_b2d0e519e8c74f2988c3017387959e41/Cargo.toml" "--release" "--lib"`
Error: command ['maturin', 'pep517', 'build-wheel', '-i', '/home/abearab/miniforge3/envs/screenpro2/bin/python', '--compatibility', 'off'] returned non-zero exit status 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for biobear
Failed to build biobear
ERROR: Could not build wheels for biobear, which is required to install pyproject.toml-based projects
Ah, ok, glad you got it. Looks like I have a little work to do dependency-wise.
It's a very nice package overall, and to your point would make for a good reference to build a "recipe" list from.
To your question, depending on the end goal, there's a somewhat crude way you could accomplish this in biobear, but it may not fit your needs well enough. This relates back to your comments trimming, in that I'm looking at performant alignment / string functions that I can add to
biobear
to better support these use cases.For example, use the seqkit example
agctggagctacc
...-- if you wanted to match match against an adapter at the start and allow for two mismatches ❯ SELECT levenshtein(substr('agctggagctacc', 1, 5), 'agcaa') dist; +------+ | dist | +------+ | 2 | +------+
You could then add that to a
WHERE
clause.Perhaps, I'll add a function that replicates
locate
, but returns a list of matches... e.g.SELECT locate(sequence_column, pattern, options...) -> list of structs {seqid, pattern, strand, start, end, matched_pattern}
Then more or less have the returned table be aligned with
seqkits
example? Is that a useful function.
Hi @tshauck – I'm getting back to this discussion and I need to use this functionality to process a dataset. I would be more than happy to test your tool in case you have any new features. What do you recommend to start with? I liked your idea to have that locate
function as part of biobear
! Let me know what you think, thanks.
Hey @abearab -- apologies for taking a bit to get onto this. The CRAM scanning feature is done, and I go started on the writing (though need some feedback from another developer).
For locate
after the fixes to dependencies to afford faster installs recently, I should be able to get back to this tomorrow. What timeline are you working with? Obviously, I want to get something you could use, but don't want to lead you on if it's not feasible given your requirements.
I have something in this branch I hope to finish up tomorrow. It's slightly different than locate as it requires a regex right now, but is relatively close... e.g.
❯ SELECT locate_regex('agctggagctacc', '[a][atcg][c]') AS locate; -- match a, then one of atc or g, then c
+-----------------------------------------------------------------------------------------------------+
| locate |
+-----------------------------------------------------------------------------------------------------+
| [{start: 1, end: 4, match: agc}, {start: 7, end: 10, match: agc}, {start: 11, end: 14, match: acc}] |
+-----------------------------------------------------------------------------------------------------+
❯ SELECT locate_regex('agctggagctacc', 'agc') AS locate; -- match only agc
+-------------------------------------------------------------------+
| locate |
+-------------------------------------------------------------------+
| [{start: 1, end: 4, match: agc}, {start: 7, end: 10, match: agc}] |
+-------------------------------------------------------------------+
This is similar to the seqkit example: https://bioinf.shenwei.me/seqkit/usage/#locate
I also hope to add a non-regex based one similar to locate
, but it's a little trickier... I'll follow up tomorrow when I know more about how hard/easy it is to add locate
w/o the regex.
@abearab I updated biobear to support locate_regex
as shown above. You can see how it looks through biobear/polars here:
biobear/python/tests/test_session.py
Lines 43 to 68 in 5d9a881
Please let me know if you have any feedback on how that works for your task -- thanks!
Hey @abearab -- apologies for taking a bit to get onto this. The CRAM scanning feature is done, and I go started on the writing (though need some feedback from another developer).
For
locate
after the fixes to dependencies to afford faster installs recently, I should be able to get back to this tomorrow. What timeline are you working with? Obviously, I want to get something you could use, but don't want to lead you on if it's not feasible given your requirements.
Hi @tshauck, Thanks for looking into this. There is no time pressure on my end (i.e. I might also not test new features in biobear
as the top priority). I'm currently working on CLI development ArcInstitute/ScreenPro2#35 but I would like to use locate
/ locate_regex
function for topics like ArcInstitute/ScreenPro2#39.
Cool, that all sounds good to me, no rush for me. Please just "at" me on this issue if/when you use this for your work if you have any thoughts on improvements and/or questions. Thanks!