TobiasHeOl/kasearch

Error when running EasySearch: only results for "Identity" column

Opened this issue · 5 comments

Dear kasearch team,

First of all, thanks for all your work, kasearch is really promising!! I'm really hoping I can get it running soon.

I'm trying to run EasySearch on the sample sequence. I downloaded the publication dataset into this folder: /researchers/laura.twomey/Tools/omics_tools/kasearch/oasdb_20230111/

from kasearch import EasySearch
# Run ka search
results = EasySearch('QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS',
    allowed_chain='Heavy',  
    allowed_species='Human', 
    regions=['whole'],  
    length_matched=[False], 
    database_path='/researchers/laura.twomey/Tools/omics_tools/kasearch/oasdb_20230111/'
)

But get this error:

Traceback (most recent call last):
  File "/home/ltwomey/src/Analysis/scRNAseq/run_kasearch.py", line 15, in <module>
    results = EasySearch(
              ^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/kasearch/easy_search.py", line 56, in EasySearch
    return targetdb.get_meta(n_query=0, n_region=0, n_sequences='all', n_jobs=n_jobs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/kasearch/kasearch.py", line 150, in get_meta
    metadf = self._extract_meta(self.current_best_ids[n_query, :n_sequences, n_region], n_jobs=n_jobs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/kasearch/meta_extract.py", line 87, in _extract_meta
    fetched_metadata = pd.concat(Parallel(n_jobs=n_jobs)(delayed(self._get_single_study_meta)(group) for group in groups))
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/joblib/parallel.py", line 1918, in __call__
    return output if self.return_generator else list(output)
                                                ^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/joblib/parallel.py", line 1847, in _get_sequential_output
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/kasearch/meta_extract.py", line 47, in _get_single_study_meta
    study_file = self.id_to_study[study_id]
                 ~~~~~~~~~~~~~~~~^^^^^^^^^^
KeyError: np.int64(3595)

I'm using:

  • biopython=1.83
  • hmmer=3.4
  • muscle=3.8.1551-6
  • anarci=1.3 (commit 79f6c575056dedef86cb8f405ebb039197923eec)
  • kasearch (commit fb0ebc7)

Update! I figured out I was getting the issue above when removing the "Bender et al lines" from the id_to_study.txt file.
When I use the original id_to_study.txt file from the 2023 OAS-aligned (63GB), kasearch runs but outputs an empty dataframe (see below). There are only 8 lines with the Identity values, the rest are empty.
I am unsure whether this is because of the Bender et al being removed from OAS, or if I am not using EasySearch correctly - any help would be greatly appreciated!

Could you let me know how to get the latest pre-aligned version of OAS?

I am running the command from the issue above:

Analysis starting at: 2024-07-05 14:57:16.627652
Running Easy Search...................................................

Heavy chain data in Bender et al. 2020 has been removed from OAS due to contamination.
Heavy chain data in Bender et al. 2020 has been removed from OAS due to contamination.
Heavy chain data in Bender et al. 2020 has been removed from OAS due to contamination.
Heavy chain data in Bender et al. 2020 has been removed from OAS due to contamination.
Finished Easy Search...................................................

Saving results...................................................

  Unnamed: 0 sequence locus  ... Total sequences Isotype  Identity
0        NaN      NaN   NaN  ...             NaN     NaN  0.899160
1        NaN      NaN   NaN  ...             NaN     NaN  0.899160
2        NaN      NaN   NaN  ...             NaN     NaN  0.892562
3        NaN      NaN   NaN  ...             NaN     NaN  0.892562
4        NaN      NaN   NaN  ...             NaN     NaN  0.890756

[5 rows x 114 columns]
Analysis finished at: 2024-07-05 15:30:56.135003



Hi Laura, thank you for using KA-Search and highlighting this issue!

Some time ago we decided to remove parts of the Bender 2020 study from OAS because we suspect some of the human sequences contain mouse sequences. However, because this would break the public pre-processed OAS for KA-Search, we updated the kasearch code to highlight when user queries would match with Bender 2020 sequences. This results in results without meta data, as the meta data is not in OAS any more. Unfortunately, we left a sequence which matches with Bender 2020 sequences as the example sequence, this has now been changed (#10).

For convenience, you can create your own pre-aligned version of OAS using the prepareOASdb.ipynb notebook. This will take some time or resources (~1 day on 20 CPUs), but you will then have an up-to-date pre-aligned version of OAS.

I hope this helps, otherwise please let me know if you have any other issues.

Hi Tobias, thank you so much for your fast answer!
I've been trying to do as you suggested but I must be misunderstanding the documentation - could you let me know if this is what you meant?
To run prepareOASdb.ipynb, I need a local version of OAS (local_oas_path = '/path/to/oas/database/').
Since I'm creating my own version, I wanted it to be the latest one. So in the OAS database website, I selected unpaired > human > and downloaded the .sh script for batch download. It's been running on 10 CPUS for the last 4 (!) days so I'm guessing this is not exactly what you meant (?) Or do I just need more resources if I want the large dataset?
Let me know, and apologies if I missed something obvious!

Hi Tobias! I managed to download all human heavy sequences (IGH) from OAS, which are now sitting in a folder as .csv.gz.
However, when I try to index the local version of oas using prepareOASdb.ipynb, I get an error (see below), which I think has to do with this line in the set_id_to_study function of prepareOASdb.py:

data_unit_files = glob.glob(os.path.join(local_oas_path, 'unpaired','*/*/*.csv.gz'))

Since all my .csv.gz are in one folder I think it cannot find them? Just wondering what am I doing wrong - I've been trying to run ka-search for a long time! Any help would be greatly appreciated:) Thank you so much!

  File ".../Tools/VIRTUAL_ENVIRONMENTS/kasearch/lib/python3.12/site-packages/kasearch/prepare_OASdb.py", line 110, in process_many_files
    data_file_subsets = [flatten(data_unit_files[x:x+subset_size]) for x in range(0, len(data_unit_files), subset_size)]
                                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: range() arg 3 must not be zero

I figured it out:)

Indeed, the input files need to be in the OASdb folder structure, so local_oas_path/unpaired/Heavy/Human/, in my case. Then it works nicely:)