serialise load BLAST data error (Cannot use a compiled regex as replacement pattern with regex=False)
Opened this issue · 3 comments
Hi,
When running mikado serialise
with BLAST tsv results I get error: Cannot use a compiled regex as replacement pattern with regex=False
Serialise log:
2024-06-13 09:30:05,071 - serialise - serialise.py:321 - INFO - setup - MainProcess - Mikado version: 2.3.4
2024-06-13 09:30:05,071 - serialise - serialise.py:322 - INFO - setup - MainProcess - Command line: /users/timg/.conda/envs/mikado/bin/mikado serialise --json-conf /scratch/timg/desiree_annotation/mikado/De_v1_hap3_chrs/v4/configuration_v4.yaml --tsv /scratch/timg/desiree_annotation/mikado/De_v1_hap3_chrs/v4/mikado_prepared.blast.tsv --orfs /scratch/timg/desiree_annotation/mikado/De_v1_hap3_chrs/v4/mikado_prepared.fasta.transdecoder.bed --blast-loading-debug
2024-06-13 09:30:05,084 - serialise - serialise.py:332 - INFO - setup - MainProcess - Random seed: 0
2024-06-13 09:30:05,084 - serialise - serialise.py:345 - INFO - setup - MainProcess - Using a sqlite database (location: /scratch/timg/desiree_annotation/mikado/De_v1_hap3_chrs/v4/mikado.db)
2024-06-13 09:30:05,084 - serialise - serialise.py:348 - INFO - setup - MainProcess - Requested 1 threads, forcing single thread: False
2024-06-13 09:30:05,085 - serialise - serialise.py:176 - INFO - load_orfs - MainProcess - Starting to load ORF data
2024-06-13 09:30:30,488 - serialise - orf.py:351 - INFO - __serialize_single_thread - MainProcess - Finished loading 369856 ORFs into the database
2024-06-13 09:30:34,336 - serialise - serialise.py:187 - INFO - load_orfs - MainProcess - Finished loading ORF data
2024-06-13 09:30:34,435 - serialise - serialise.py:142 - INFO - load_blast - MainProcess - Starting to load BLAST data
2024-06-13 09:30:34,435 - serialise - blast_serialiser.py:82 - INFO - __init__ - MainProcess - Number of dedicated workers: 1
2024-06-13 09:30:34,441 - serialise - blast_serialiser.py:106 - WARNING - __init__ - MainProcess - Activating the XML debug mode
2024-06-13 09:30:45,733 - serialise - blast_serialiser.py:249 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2024-06-13 09:30:45,975 - serialise - blast_serialiser.py:283 - INFO - __serialize_targets - MainProcess - Loaded 41712 objects into the "target" table
2024-06-13 09:30:46,075 - serialise - blast_serialiser.py:174 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2024-06-13 09:30:46,492 - serialise - blast_serialiser.py:226 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table
2024-06-13 09:30:47,568 - serialise - blast_serialiser.py:233 - INFO - __serialize_queries - MainProcess - 450524 in queries
2024-06-13 09:30:47,598 - serialise - tab_serialiser.py:31 - INFO - _serialise_tabular - MainProcess - Creating a pool with 1 workers for analysing BLAST results
2024-06-13 09:30:48,538 - serialise - tabular_utils.py:431 - INFO - parse_tab_blast - MainProcess - Reading /scratch/timg/desiree_annotation/mikado/De_v1_hap3_chrs/v4/mikado_prepared.blast.tsv data
2024-06-13 09:30:57,391 - serialise - serialise.py:388 - ERROR - serialise - MainProcess - Mikado crashed due to an error. Please check the logs for hints on the cause of the error; if it is a bug, please report it to https://github.com/EI-CoreBioinformatics/mikado/issues.
2024-06-13 09:30:57,392 - serialise - serialise.py:390 - ERROR - serialise - MainProcess - Cannot use a compiled regex as replacement pattern with regex=False
Command used:
mikado serialise \
--json-conf $out_dir/configuration_$version.yaml \
--tsv $out_dir/mikado_prepared.blast.tsv \
--orfs $out_dir/mikado_prepared.fasta.transdecoder.bed \
--blast-loading-debug
First 20 lines of $out_dir/mikado_prepared.blast.tsv:
bacillus_STRG-bacillus.1.1 sp|Q8RWX4|KOC1_ARATH 35.294 34 22 0 102 1 300 333 1.0 30.4 61.76 ILVL1FLSKQS1RQKNGSMERE2LVER4ESER1VI1EKASVSVASTDL1FY1
bacillus_STRG-bacillus.1.1 sp|Q1EHT7|C3H4_ORYSJ 36.000 25 16 0 261 187 879 903 2.6 29.3 72.00 1STQKIL2IVLF1LFKA1GIND2TENSLIFSHERQ2IL
bacillus_STRG-bacillus.1.1 sp|O65451|FB333_ARATH 44.737 38 17 1 120 19 175 212 4.4 28.5 63.16 2VQKRGN1ITVRVL1SK3KN-R-I-E-MGVML2QKLS1RD2DS2ED1RK2VI
bacillus_STRG-bacillus.1.1 sp|Q94EJ6|PMTE_ARATH 20.930 86 66 1 588 331 354 437 6.3 28.1 46.51 1IF1QKHKYIDNSDKR1GCGDPR1VT1SVLDET1*-S-IKQR1GDRTED1TV1WYNKGEYI1VTVCAV2SFFPRKLVASSNKELEKEMVLAKGRGNKMLRK1WFSPKEERML1GAIVVPAPVSKIMSRKWGTL1KNNG1EDGEMENSAYLQ2
bacillus_STRG-bacillus.1.1 sp|Q9FIY7|SM3L3_ARATH 33.333 36 24 0 129 22 554 589 6.4 28.1 66.67 KE1EDEYEDVEKRGA1ILVLVAFISAQKLR3MCRQEQQS2RQGNIK4VARP1AS
cpb_STRG-cpb.320.1 sp|Q8VXX4|RFC3_ARATH 85.606 132 19 0 176 571 1 132 3.56e-82 248 96.21 3IV6TS7QE1VI2NK1RK5GQ23RK2FY1PA2DE6KRIA6TS4VL8HN4NT10VI21FY1
cpb_STRG-cpb.320.1 sp|Q852K3|RFC5_ORYSJ 78.030 132 29 0 176 571 1 132 2.43e-76 234 89.39 3IV11IT2QDDQ5RK3SA1GQ18IV3LIRK1IM2PASG5VM3IT2VI1AT1TS1TN1DEVI2TATM3TA4LM15IV11TA3KRGA2
cpb_STRG-cpb.320.1 sp|Q9CAQ8|RFC5_ARATH 34.091 132 63 4 182 520 41 167 7.93e-17 78.2 53.03 1IVDE4KQTS2KD1IAVA1QR1VIAIQDNTLIRDKR1VTSN1GNDKCL4FL3SP1ST2KTTSLT1ML1LVLA1QKILFY2S-A-D-K-V-1VYER1KM1WLKEVLDN1GS-DTD1TG3-V-R-QEQLITQTDLFSA2HQHSVFES1NGPK1-S-V-K-L-V-L-L-D-E-A-D-A-M-T-K2GQ1QADL1YR1VIQEEKIYIT1
cpb_STRG-cpb.320.1 sp|Q93ZX1|RFC4_ARATH 43.939 66 37 0 182 379 11 76 1.24e-14 71.6 63.64 1IVDE5TQLVDKKD1IAVHHQQEDE1AVQRNV1RTKNLTVLSQETGA4LM5SP1ST2KT1LTIAML1LILARH1IL3SEALDY1VSKR1
cpb_STRG-cpb.320.1 sp|Q6YZ54|RFC3_ORYSJ 29.605 152 72 4 182 619 40 162 6.14e-14 69.7 48.68 1IVDE4KQTS1DGKD1IAVA1QR1VIAVQDNTLIRDKR1VTSN1GNDRCL4FL3SP1ST2KTTSLT1ML1LVLA1QKILFY1PSS-A-D-K-V-KQVYEG1KM1W-K-V-D-A-G-T-R-T-I-D-V-E-L-T-T-L-S-S-T-H-H-VL3PA2AEGRFGQI1R-Y-2QREQIQ1KQEDMF1KSNA1PSILDSTFKGGA1KQGSFV1-M-V-L-L-D-EGA1SATM1*KADSAFQYFLA1KR2LI
cpb_STRG-cpb.320.2 sp|Q8VXX4|RFC3_ARATH 85.606 132 19 0 145 540 1 132 6.89e-83 250 96.21 3IV6TS7QE1VI2NK1RK5GQ23RK2FY1PA2DE6KRIA6TS4VL8HN4NT10VI21FY1
cpb_STRG-cpb.320.2 sp|Q852K3|RFC5_ORYSJ 78.030 132 29 0 145 540 1 132 4.20e-77 235 89.39 3IV11IT2QDDQ5RK3SA1GQ18IV3LIRK1IM2PASG5VM3IT2VI1AT1TS1TN1DEVI2TATM3TA4LM15IV11TA3KRGA2
cpb_STRG-cpb.320.2 sp|Q9CAQ8|RFC5_ARATH 34.091 132 63 4 151 489 41 167 4.19e-17 78.6 53.03 1IVDE4KQTS2KD1IAVA1QR1VIAIQDNTLIRDKR1VTSN1GNDKCL4FL3SP1ST2KTTSLT1ML1LVLA1QKILFY2S-A-D-K-V-1VYER1KM1WLKEVLDN1GS-DTD1TG3-V-R-QEQLITQTDLFSA2HQHSVFES1NGPK1-S-V-K-L-V-L-L-D-E-A-D-A-M-T-K2GQ1QADL1YR1VIQEEKIYIT1
cpb_STRG-cpb.320.2 sp|Q93ZX1|RFC4_ARATH 43.939 66 37 0 151 348 11 76 6.79e-15 72.0 63.64 1IVDE5TQLVDKKD1IAVHHQQEDE1AVQRNV1RTKNLTVLSQETGA4LM5SP1ST2KT1LTIAML1LILARH1IL3SEALDY1VSKR1
cpb_STRG-cpb.320.2 sp|Q6YZ54|RFC3_ORYSJ 29.605 152 72 4 151 588 40 162 3.33e-14 70.5 48.68 1IVDE4KQTS1DGKD1IAVA1QR1VIAVQDNTLIRDKR1VTSN1GNDRCL4FL3SP1ST2KTTSLT1ML1LVLA1QKILFY1PSS-A-D-K-V-KQVYEG1KM1W-K-V-D-A-G-T-R-T-I-D-V-E-L-T-T-L-S-S-T-H-H-VL3PA2AEGRFGQI1R-Y-2QREQIQ1KQEDMF1KSNA1PSILDSTFKGGA1KQGSFV1-M-V-L-L-D-EGA1SATM1*KADSAFQYFLA1KR2LI
helixer_De_v1_hap3_chrs_chr_1_004325.1 sp|Q8VXX4|RFC3_ARATH 85.606 132 19 0 215 610 1 132 3.46e-83 251 96.21 3IV6TS7QE1VI2NK1RK5GQ23RK2FY1PA2DE6KRIA6TS4VL8HN4NT10VI21FY1
helixer_De_v1_hap3_chrs_chr_1_004325.1 sp|Q852K3|RFC5_ORYSJ 78.030 132 29 0 215 610 1 132 1.91e-77 236 89.39 3IV11IT2QDDQ5RK3SA1GQ18IV3LIRK1IM2PASG5VM3IT2VI1AT1TS1TN1DEVI2TATM3TA4LM15IV11TA3KRGA2
helixer_De_v1_hap3_chrs_chr_1_004325.1 sp|Q9CAQ8|RFC5_ARATH 38.235 102 54 3 221 514 41 137 4.63e-17 78.6 59.80 1IVDE4KQTS2KD1IAVA1QR1VIAIQDNTLIRDKR1VTSN1GNDKCL4FL3SP1ST2KTTSLT1ML1LVLA1QKILFY2S-A-D-K-V-1VYER1KM1WLKEVLDN1GS-DTD1TG3-V-R-QEQLITQTDLFSA2HQHSVFES1NGPK1
helixer_De_v1_hap3_chrs_chr_1_004325.1 sp|Q93ZX1|RFC4_ARATH 43.939 66 37 0 221 418 11 76 6.52e-15 72.4 63.64 1IVDE5TQLVDKKD1IAVHHQQEDE1AVQRNV1RTKNLTVLSQETGA4LM5SP1ST2KT1LTIAML1LILARH1IL3SEALDY1VSKR1
helixer_De_v1_hap3_chrs_chr_1_004325.1 sp|Q6YZ54|RFC3_ORYSJ 38.554 83 45 2 221 466 40 117 4.07e-14 70.1 62.65 1IVDE4KQTS1DGKD1IAVA1QR1VIAVQDNTLIRDKR1VTSN1GNDRCL4FL3SP1ST2KTTSLT1ML1LVLA1QKILFY1PSS-A-D-K-V-KQVYEG1KM1WLKEVLDN1GS-DTE1TG3
I tried using --force
and --blast_targets $target_fasta
and changing --tsv
to --xml
and the error stays the same.
I am confused about --blast_targets
flag. When is it needed?
Blast tsv was constructed with:
makeblastdb \
-in $out_dir/blast/$prot_name.fa \
-dbtype prot -parse_seqids > \
$out_dir/blast/"$prot_name"_prepare.log
blastx -max_target_seqs 5 \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos btop" \
-num_threads $threads \
-query $out_dir/mikado_prepared.fasta \
-db $out_dir/blast/$prot_name.fa \
-out $out_dir/mikado_prepared.blast.tsv
I could find only this similar issue that mentions regular expressions when importing BLAST data: #392
Should I try to rerun makeblastdb
without -parse_seqids
?
It was also mentioned in another issue that some combination of blast results columns have to be unique. But the error there was different... I did not do any filtering of BLAST results.
Any ideas, what is causing the problem?
As a starting point I would suggest simplifying the headers in the fasta file used to create your blast DB and redo your blast e.g. seqkit -i
I simplified fasta headers to only include seq ID (e.g. Q8RWX4) and checked for potential duplicates. I created a new BLAST DB, rerun BLAST and the error stays the same. For some reason, the target IDs in the tsv file include string sp|<actual_id>|
.
Because of this I also tried to use IDs with a prefix (e.g. uniprot_Q8RWX4) and the IDs in the tsv now match fasta headers, but the error stays the same.
What else can I try? I'm running out of ideas.
Should I rerun some previous stages of mikado after creating new BLAST results?
I did not rerun any, only updated the path to BLAST target fasta in configuration yaml (for --json-conf
).
Edit: I am using Mikado v2.3.4
The error Cannot use a compiled regex as replacement pattern with regex=False
is a pandas error, referring to str.replace
function, used here:
mikado/Mikado/serializers/blast_serializer/tabular_utils.py
Lines 266 to 267 in 69abafe
From Pandas version >=2, argument regex
defaults to False. I added regex=True
to both lines and serialize now works.
My installation of Mikado installed with mamba uses pandas v2.2.2.
I'm not familiar with forking and issuing commits, so I'll leave this to you :)