alternative for `seqkit locate` that can locate subsequences/motifs

I found seqkit locate very useful but it was surprisingly slow. Would you think there is any biobear approach for that? In general, any seqkit functions are very common and useful – in case you are interested to look deeper :)

It's a very nice package overall, and to your point would make for a good reference to build a "recipe" list from.

To your question, depending on the end goal, there's a somewhat crude way you could accomplish this in biobear, but it may not fit your needs well enough. This relates back to your comments trimming, in that I'm looking at performant alignment / string functions that I can add to biobear to better support these use cases.

For example, use the seqkit example agctggagctacc...

-- if you wanted to match match against an adapter at the start and allow for two mismatches
❯ SELECT levenshtein(substr('agctggagctacc', 1, 5), 'agcaa') dist;
+------+
| dist |
+------+
| 2    |
+------+

You could then add that to a WHERE clause.

Perhaps, I'll add a function that replicates locate, but returns a list of matches... e.g.

SELECT locate(sequence_column, pattern, options...) -> list of structs {seqid, pattern, strand, start, end, matched_pattern}

Then more or less have the returned table be aligned with seqkits example? Is that a useful function.

Thanks @tshauck for your response. I think that's interesting addition (it's interesting that levenshtein distance is already implemented; I didn't know that! I'll need to ask you another question about that but I'll leave this question clear for now).

For the locate question here, I had some specific thought which is described here – https://github.com/abearab/PosSpacer. I just made this a public repo so you can see my algorithm design notes. However, I gave up on that until I could replace my codes with biobear! The preprocessed NGS reads needs to be counted and the location of spacers in sequencing reads would be nice to be reported.

@abearab, cool, that makes sense to me... I just updated biobear on pypi with the new quality score function as well as an alignment_score function that does basic local alignment between two sequences (e.g. e69df6b). I'll explore more how to return alignment position(s) within a string and follow up when I have some more info.

@tshauck I have issue installing it

(this might be a separate issue, I just created a new VM and fresh OS, so ...)

  × Building wheel for biobear (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [168 lines of output]
      Running `maturin pep517 build-wheel -i /home/abearab/miniforge3/envs/screenpro2/bin/python --compatibility off`
      🍹 Building a mixed python/rust project
      🔗 Found pyo3 bindings
      🐍 Found CPython 3.9 at /home/abearab/miniforge3/envs/screenpro2/bin/python
      📡 Using build options features from pyproject.toml
         Compiling libc v0.2.153
       ...
      The following warnings were emitted during compilation:
      
      warning: ring@0.17.8: Failed to run: "cc" "--version"
      
      error: failed to run custom build command for `ring v0.17.8`
      
      Caused by:
        process didn't exit successfully: `/tmp/pip-install-r9v2vcz0/biobear_b2d0e519e8c74f2988c3017387959e41/target/release/build/ring-646b006cd1874eb5/build-script-build` (exit status: 1)
        --- stdout
        cargo:rerun-if-env-changed=RING_PREGENERATE_ASM
        cargo:rustc-env=RING_CORE_PREFIX=ring_core_0_17_8_
        OPT_LEVEL = Some("3")
        TARGET = Some("x86_64-unknown-linux-gnu")
        HOST = Some("x86_64-unknown-linux-gnu")
        cargo:rerun-if-env-changed=CC_x86_64-unknown-linux-gnu
        CC_x86_64-unknown-linux-gnu = None
        cargo:rerun-if-env-changed=CC_x86_64_unknown_linux_gnu
        CC_x86_64_unknown_linux_gnu = None
        cargo:rerun-if-env-changed=HOST_CC
        HOST_CC = None
        cargo:rerun-if-env-changed=CC
        CC = None
        cargo:rerun-if-env-changed=CC_ENABLE_DEBUG_OUTPUT
        cargo:warning=Failed to run: "cc" "--version"
        cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
        CRATE_CC_NO_DEFAULTS = None
        DEBUG = Some("false")
        CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
        cargo:rerun-if-env-changed=CFLAGS_x86_64-unknown-linux-gnu
        CFLAGS_x86_64-unknown-linux-gnu = None
        cargo:rerun-if-env-changed=CFLAGS_x86_64_unknown_linux_gnu
        CFLAGS_x86_64_unknown_linux_gnu = None
        cargo:rerun-if-env-changed=HOST_CFLAGS
        HOST_CFLAGS = None
        cargo:rerun-if-env-changed=CFLAGS
        CFLAGS = None
      
        --- stderr
      
      
        error occurred: Failed to find tool. Is `cc` installed?
      
      
      warning: build failed, waiting for other jobs to finish...
      💥 maturin failed
        Caused by: Failed to build a native library through cargo
        Caused by: Cargo build finished with "exit status: 101": `env -u CARGO PYO3_ENVIRONMENT_SIGNATURE="cpython-3.9-64bit" PYO3_PYTHON="/home/abearab/miniforge3/envs/screenpro2/bin/python" PYTHON_SYS_EXECUTABLE="/home/abearab/miniforge3/envs/screenpro2/bin/python" "cargo" "rustc" "--features" "pyo3/extension-module" "--message-format" "json-render-diagnostics" "--manifest-path" "/tmp/pip-install-r9v2vcz0/biobear_b2d0e519e8c74f2988c3017387959e41/Cargo.toml" "--release" "--lib"`
      Error: command ['maturin', 'pep517', 'build-wheel', '-i', '/home/abearab/miniforge3/envs/screenpro2/bin/python', '--compatibility', 'off'] returned non-zero exit status 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for biobear
Failed to build biobear
ERROR: Could not build wheels for biobear, which is required to install pyproject.toml-based projects

#106 (comment)

I had to install this in system level:

sudo apt install GCC

Ah, ok, glad you got it. Looks like I have a little work to do dependency-wise.

It's a very nice package overall, and to your point would make for a good reference to build a "recipe" list from.

To your question, depending on the end goal, there's a somewhat crude way you could accomplish this in biobear, but it may not fit your needs well enough. This relates back to your comments trimming, in that I'm looking at performant alignment / string functions that I can add to biobear to better support these use cases.

For example, use the seqkit example agctggagctacc...
-- if you wanted to match match against an adapter at the start and allow for two mismatches
❯ SELECT levenshtein(substr('agctggagctacc', 1, 5), 'agcaa') dist;
+------+
| dist |
+------+
| 2    |
+------+
You could then add that to a WHERE clause.

Perhaps, I'll add a function that replicates locate, but returns a list of matches... e.g.
SELECT locate(sequence_column, pattern, options...) -> list of structs {seqid, pattern, strand, start, end, matched_pattern}
Then more or less have the returned table be aligned with seqkits example? Is that a useful function.

Hi @tshauck – I'm getting back to this discussion and I need to use this functionality to process a dataset. I would be more than happy to test your tool in case you have any new features. What do you recommend to start with? I liked your idea to have that locate function as part of biobear! Let me know what you think, thanks.

Hey @abearab -- apologies for taking a bit to get onto this. The CRAM scanning feature is done, and I go started on the writing (though need some feedback from another developer).

For locate after the fixes to dependencies to afford faster installs recently, I should be able to get back to this tomorrow. What timeline are you working with? Obviously, I want to get something you could use, but don't want to lead you on if it's not feasible given your requirements.

I have something in this branch I hope to finish up tomorrow. It's slightly different than locate as it requires a regex right now, but is relatively close... e.g.

❯ SELECT locate_regex('agctggagctacc', '[a][atcg][c]') AS locate; -- match a, then one of atc or g, then c
+-----------------------------------------------------------------------------------------------------+
| locate                                                                                              |
+-----------------------------------------------------------------------------------------------------+
| [{start: 1, end: 4, match: agc}, {start: 7, end: 10, match: agc}, {start: 11, end: 14, match: acc}] |
+-----------------------------------------------------------------------------------------------------+

❯ SELECT locate_regex('agctggagctacc', 'agc') AS locate; -- match only agc
+-------------------------------------------------------------------+
| locate                                                            |
+-------------------------------------------------------------------+
| [{start: 1, end: 4, match: agc}, {start: 7, end: 10, match: agc}] |
+-------------------------------------------------------------------+

This is similar to the seqkit example: https://bioinf.shenwei.me/seqkit/usage/#locate

I also hope to add a non-regex based one similar to locate, but it's a little trickier... I'll follow up tomorrow when I know more about how hard/easy it is to add locate w/o the regex.

@abearab I updated biobear to support locate_regex as shown above. You can see how it looks through biobear/polars here:

biobear/python/tests/test_session.py

Lines 43 to 68 in 5d9a881

    
           def test_read_fastq_with_qs_to_list(): 
        
               """Test quality scores to list.""" 
        
               session = connect() 
        
               fastq_path = DATA / "test.fastq.gz" 
        
               df = session.sql( 
        
                   f""" 
        
                   SELECT quality_scores_to_list(quality_scores) quality_score_list, locate_regex(sequence, '[AC]AT') locate 
        
                   FROM fastq_scan('{fastq_path}') 
        
                   """ 
        
               ).to_polars() 
        
               assert len(df) == 2 
        
               assert df.get_column("quality_score_list").to_list()[0][:5] == [0, 6, 6, 9, 7] 
        
               assert df.get_column("locate").to_list() == [ 
        
                   [ 
        
                       {"start": 29, "end": 32, "match": "AAT"}, 
        
                       {"start": 36, "end": 39, "match": "AAT"}, 
        
                       {"start": 40, "end": 43, "match": "CAT"}, 
        
                   ], 
        
                   [ 
        
                       {"start": 29, "end": 32, "match": "AAT"}, 
        
                       {"start": 36, "end": 39, "match": "AAT"}, 
        
                       {"start": 40, "end": 43, "match": "CAT"}, 
        
                   ], 
        
               ]

.

Please let me know if you have any feedback on how that works for your task -- thanks!

Hey @abearab -- apologies for taking a bit to get onto this. The CRAM scanning feature is done, and I go started on the writing (though need some feedback from another developer).

For locate after the fixes to dependencies to afford faster installs recently, I should be able to get back to this tomorrow. What timeline are you working with? Obviously, I want to get something you could use, but don't want to lead you on if it's not feasible given your requirements.

Hi @tshauck, Thanks for looking into this. There is no time pressure on my end (i.e. I might also not test new features in biobear as the top priority). I'm currently working on CLI development ArcInstitute/ScreenPro2#35 but I would like to use locate / locate_regex function for topics like ArcInstitute/ScreenPro2#39.

Cool, that all sounds good to me, no rush for me. Please just "at" me on this issue if/when you use this for your work if you have any thoughts on improvements and/or questions. Thanks!

	def test_read_fastq_with_qs_to_list():
	"""Test quality scores to list."""
	session = connect()
	fastq_path = DATA / "test.fastq.gz"

	df = session.sql(
	f"""
	SELECT quality_scores_to_list(quality_scores) quality_score_list, locate_regex(sequence, '[AC]AT') locate
	FROM fastq_scan('{fastq_path}')
	"""
	).to_polars()

	assert len(df) == 2
	assert df.get_column("quality_score_list").to_list()[0][:5] == [0, 6, 6, 9, 7]
	assert df.get_column("locate").to_list() == [
	[
	{"start": 29, "end": 32, "match": "AAT"},
	{"start": 36, "end": 39, "match": "AAT"},
	{"start": 40, "end": 43, "match": "CAT"},
	],
	[
	{"start": 29, "end": 32, "match": "AAT"},
	{"start": 36, "end": 39, "match": "AAT"},
	{"start": 40, "end": 43, "match": "CAT"},
	],
	]