sqanti_filter.py ML classifier variable importance table output error

Question

sqanti_filter.py ML classifier variable importance table output error

jelfman opened this issue 9 months ago · 9 comments

jelfman commented 9 months ago

Is there an existing issue for this?

I have searched the existing issues

Have you loaded the SQANTI3.env conda environment?

I have loaded the SQANTI3.env conda environment

Problem description

Hi!

It seems like my filter run worked fine, but I'm getting an error generating outputs. As far as I can tell, the remaining outputs are generated correctly, but I can't seem to figure out what is going wrong with generating the variable importance table, and I want to make sure that sqanti throwing this error doesn't signal incomplete reporting.

Code sample

Input & Sample of stdout signaling successful run:
/home/jme5fe/.conda/envs/SQANTI3.env/bin/Rscript /sfs/weka/scratch/jme5fe/SQANTI3-5.2/utilities/filter/SQANTI3_MLfilter.R -c /sfs/weka/scratch/jme5fe/isoseq_trial_allcorrec
ted/TestisShortRead/QC/FINAL_TRANSCRIPTOME/LRonly_Corrected_Unguided_All.transcript_models_classification.txt -o LRonly_Corrected_Unguided_All_filtered -d ml_sqanti_filter
-t 0.8 -j 0.7 -i 60 -f False -e False -m TRUE -z 3000

Writing filter results to classification file...

    Wrote filter results (ML and intra-priming) to new classification table:
    LRonly_Corrected_Unguided_All_filtered_MLresult_classification.txt file.

    Wrote isoform list (classified as non-artifacts by both ML and intra-priming
    filters) to LRonly_Corrected_Unguided_All_filtered_inclusion-list.txt file

SUMMARY OF MACHINE LEARNING + INTRA-PRIMING FILTERS:

Artifact Isoform
172 8562

SQANTI3 ML filter finished successfully!

Error

     SQANTI3 Machine Learning filter report

Loading required package: magrittr

Reading ML result classification table...

Reading classifier variable importance table...
Error in dplyr::mutate():
ℹ In argument: variable = factor(variable) %>% forcats::fct_reorder(importance).
Caused by error:
! object 'variable' not found
Backtrace:
▆

├─imp %>% ...
├─dplyr::mutate(., variable = factor(variable) %>% forcats::fct_reorder(importance))
├─dplyr:::mutate.data.frame(., variable = factor(variable) %>% forcats::fct_reorder(importance))
│ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
│ ├─base::withCallingHandlers(...)
│ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
│ └─mask$eval_all_mutate(quo)
│ └─dplyr (local) eval()
├─factor(variable) %>% forcats::fct_reorder(importance)
├─forcats::fct_reorder(., importance)
│ └─forcats:::check_factor(.f)
├─base::factor(variable)
└─base::.handleSimpleError(...)
└─dplyr (local) h(simpleError(msg, call))

└─rlang::abort(message, class = error_class, parent = parent, call = error_call)

Execution halted
Output written to: ml_sqanti_filter/LRonly_Corrected_Unguided_All_filtered.filtered.gtf
Output written to: ml_sqanti_filter/LRonly_Corrected_Unguided_All_filtered.filtered.sam
Output written to: ml_sqanti_filter/LRonly_Corrected_Unguided_All_filtered.filtered.faa
Output written to: ml_sqanti_filter/LRonly_Corrected_Unguided_All_filtered.filtered.gff3

Anything else?

Of note, I'm running this on a selection of (mostly fusion) transcripts, but I'm not sure of how that would factor into the generation and export of the table.

Thanks!
Justin

Answer 1 · 2024-04-18T12:05:01.000Z

Hi @jelfman,

The error you're showing happens during report generation, specifically when trying to read the variable importance table into R. The filter finished correctly according to the logs, so the variable importance file should be in the output folder. Can you verify that this is true, and paste the contents here?

Ángeles

Answer 2 · 2024-04-18T14:57:05.000Z

Hi Angeles,

I don't see a table labeled as such in the outputs folder, when I run (including the intermediate files option), here are the contents:

GMST
LRonly_Corrected_Unguided_All_filtered_classification.txt
LRonly_Corrected_Unguided_All_filtered_corrected.faa
LRonly_Corrected_Unguided_All_filtered_corrected.fasta
LRonly_Corrected_Unguided_All_filtered_corrected.genePred
LRonly_Corrected_Unguided_All_filtered_corrected.gtf
LRonly_Corrected_Unguided_All_filtered_corrected.gtf.cds.gff
LRonly_Corrected_Unguided_All_filtered.gff3
LRonly_Corrected_Unguided_All_filtered_gtstats.txt
LRonly_Corrected_Unguided_All_filtered_junctions.txt
LRonly_Corrected_Unguided_All_filtered.params.txt
LRonly_Corrected_Unguided_All_filtered_SQANTI3_report.pdf
RC_corrected_to_LRonly_Corrected_Unguided_All_filtered.sam
RC_corrected_to_LRonly_Corrected_Unguided_All_filtered_ss.txt
refAnnotation_LRonly_Corrected_Unguided_All_filtered.genePred
RTS

Answer 3 · 2024-04-18T15:27:13.000Z

@jelfman Could you attach the full ML filter log file? I'd like to take a look and see if I spot anything strange during ML filtering. There should be some sort of error during variable importance table writing and/or other warnings upstream of that. And it would be useful for developers in case they find something odd and need to debug!

Answer 4 · 2024-04-18T16:18:07.000Z

I don't see a log file either -- would this be within the directory I ran it in or the one provided with the -d flag? I tried running it again, and ran into an error on the same step, and perhaps an issue with the GTF generated by the previous QC run?

Reading classifier variable importance table...
Error in dplyr::mutate():
ℹ In argument: variable = factor(variable) %>% forcats::fct_reorder(importance).
Caused by error:
! object 'variable' not found
Backtrace:
▆

├─imp %>% ...
├─dplyr::mutate(., variable = factor(variable) %>% forcats::fct_reorder(importance))
├─dplyr:::mutate.data.frame(., variable = factor(variable) %>% forcats::fct_reorder(importance))
│ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
│ ├─base::withCallingHandlers(...)
│ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
│ └─mask$eval_all_mutate(quo)
│ └─dplyr (local) eval()
├─factor(variable) %>% forcats::fct_reorder(importance)
├─forcats::fct_reorder(., importance)
│ └─forcats:::check_factor(.f)
├─base::factor(variable)
└─base::.handleSimpleError(...)
└─dplyr (local) h(simpleError(msg, call))

└─rlang::abort(message, class = error_class, parent = parent, call = error_call)

Execution halted
Traceback (most recent call last):
File "/scratch/jme5fe/SQANTI3-5.2/sqanti3_filter.py", line 291, in
main()
File "/scratch/jme5fe/SQANTI3-5.2/sqanti3_filter.py", line 278, in main
filter_files(args, ids, inclusion_file)
File "/scratch/jme5fe/SQANTI3-5.2/sqanti3_filter.py", line 60, in filter_files
for r in collapseGFFReader(args.gtf):
File "/scratch/jme5fe/SQANTI3-5.2/cDNA_Cupcake/cupcake/io/GFF.py", line 419, in next
return self.read()
File "/scratch/jme5fe/SQANTI3-5.2/cDNA_Cupcake/cupcake/io/GFF.py", line 590, in read
assert raw[2] == 'transcript'
AssertionError

Also apologies, I got myself mixed up with the list of files. I reran QC after to see what had changed with the filter runs. The contents of the ml folder are as follows:

intermediate_LRonly_Corrected_Unguided_All_filtered_MLinput_table.txt
LRonly_Corrected_Unguided_All_filtered.filtered.gtf
LRonly_Corrected_Unguided_All_filtered_inclusion-list.txt
LRonly_Corrected_Unguided_All_filtered_MLresult_classification.txt
LRonly_Corrected_Unguided_All_filtered_params.txt
LRonly_Corrected_Unguided_All_filtered_TN_list.txt
LRonly_Corrected_Unguided_All_filtered_TP_list.txt

Answer 5 · 2024-04-19T11:17:50.000Z

Yes, that error happens during filtering of the QC-output GTF file using the inclusion list (transcripts that pass the filter). So there may be something wrong with the format. The error is caused by not finding the right thing in the 3rd GTF column (which should specify whether the feature is a transcript, an exon, etc). Have a look at the GTF in case there may be upstream problems.

As per the log, it is simply the prints that you see when the ML filter is running. You can redirect the stdout to a file, but this is not done by SQANTI3 automatically.

I'm mostly curious to see whether there are other errors or warnings when running the filter that may indicate why the variable importance table is not generated. It may be that the random forest classifier is not being correctly trained or applied to the data. The reason why may have to do with what you have in your transcriptome or how you have defined your TP and/or TN set, but it's hard to know without more information.

Last, make sure that you are deleting all files in the output folder before re-running the ML filter. When a pre-trained random forest classifier is found in the output folder, it is automatically loaded and applied to the data. If this is happening, you will see the following in the log:

Random forest classifier already exists in output directory: loading randomforest.RData object.
***Note: this will skip classifier training.")
If you have modified TP and TN sets and wish to train a new model,
delete randomforest.RData or provide a different output directory.

Ángeles

Answer 6 · 2024-04-23T14:22:42.000Z

Hello,
I get exactly the same error, here is the log I get out of it, the only thing I can think of is that there aren't enough NNC transcripts. I expect the input gtf to have a lot of pretty bad quality predictions, so maybe the ML is not appropriate here (I tested the rules based filter and worked quite well).

Command:

sqanti3_filter.py ml --gtf ../results/tumour1_sqanti3_corrected.gtf \
                        -o tumour1_sqanti3_corrected_ml \
                        -d ../results \
                        ../results/tumour1_sqanti3_classification.txt

Log:

Rscript (R) version 4.3.1 (2023-06-16)
Write arguments to results/tumour1_sqanti3_corrected_ml_params.txt...

Running SQANTI3 filtering...

/camp/home/ruizc/tracerx-lung/tctProjects/ruizc/apps/anaconda/envs/SQANTI3.env/bin/Rscript /nemo/lab/swantonc/working/ruizc/2023-06-02-test_ont_rna_tools/results/2024-04-04-test_sqanti3_v2/SQANTI3-5.2.1/utilities/filter/SQANTI3_MLfilter.R -c /nemo/lab/swantonc/working/ruizc/2023-06-02-test_ont_rna_tools/results/2024-04-04-test_sqanti3_v2/results//tumour1_sqanti3_classification.txt -o tumour1_sqanti3_corrected_ml -d ../results -t 0.8 -j 0.7 -i 60 -f False     -e False -m False -z 3000
-------------------------------------------------

         SQANTI3 Machine Learning filter

--------------------------------------------------

CURRENT ML FILTER PARAMETERS:

 [1] "sqanti_classif: /nemo/lab/swantonc/working/ruizc/2023-06-02-test_ont_rna_tools/results/2024-04-04-test_sqanti3_v2/results/tumour1_sqanti3_classification.txt"
 [2] "output: tumour1_sqanti3_corrected_ml"                                                                                                                        
 [3] "dir: ../results"                                                                                                                                                      
 [4] "percent_training: 0.8"                                                                                                                                                
 [5] "threshold: 0.7"                                                                                                                                                       
 [6] "intrapriming: 60"                                                                                                                                                     
 [7] "force_fsm_in: FALSE"                                                                                                                                                  
 [8] "force_multi_exon: FALSE"                                                                                                                                              
 [9] "intermediate_files: FALSE"                                                                                                                                            
[10] "max_class_size: 3000"                                                                                                                                                 
[11] "help: FALSE"                                                                                                                                                          

        INITIAL ML CHECKS:

Reading SQANTI3 *_classification.txt file...

Checking data for mono and multi-exon transcripts...

         ***Note: ML filter can only be applied to multi-exon transcripts. 

         60153 multi-exon transcript isoforms found in SQ3 classification file.

Checking input data for True Positive (TP) and True Negative (TN) sets...

        Warning message: 
         Training set not provided -will be created from input data.

Warning message:
            
Not enough (< 250) Novel Not in Catalog (NNC) + non-canonical transcripts.

Warning message:
            
Not enough (< 250) Novel Not in Catalog (NNC) transcripts, skipping ML filter.

        ***Note: try re-running ML filter with a user-defined TN set >=250 isoforms!

Wrote generated TP and TN lists to files:

        ../resultstumour1_sqanti3_corrected_ml_TP_list.txt

        ../results/tumour1_sqanti3_corrected_ml_TN_list.txt

-------------------------------------------------

Applying intra-priming filter to our dataset.

Intra-priming filtered transcripts:

 FALSE   TRUE 
682991  21893 

 -------------------------------------------------

Writing filter results to classification file...

        Wrote filter results (ML and intra-priming) to new classification table:
        tumour1_sqanti3_corrected_ml_MLresult_classification.txt file.

        Wrote isoform list (classified as non-artifacts by both ML and intra-priming
        filters) to tumour1_sqanti3_corrected_ml_inclusion-list.txt file

-------------------------------------------------

SUMMARY OF MACHINE LEARNING + INTRA-PRIMING FILTERS:

Artifact  Isoform 
   21893   682991 

-------------------------------------------------

SQANTI3 ML filter finished successfully!

-------------------------------------------------

-------------------------------------------------

         SQANTI3 Machine Learning filter report

--------------------------------------------------
Loading required package: magrittr

Reading ML result classification table...

Reading classifier variable importance table...
Error in `dplyr::mutate()`:
ℹ In argument: `variable = factor(variable) %>% forcats::fct_reorder(importance)`.
Caused by error:
! object 'variable' not found
Backtrace:
     ▆
  1. ├─imp %>% ...
  2. ├─dplyr::mutate(., variable = factor(variable) %>% forcats::fct_reorder(importance))
  3. ├─dplyr:::mutate.data.frame(., variable = factor(variable) %>% forcats::fct_reorder(importance))
  4. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  5. │   ├─base::withCallingHandlers(...)
  6. │   └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  7. │     └─mask$eval_all_mutate(quo)
  8. │       └─dplyr (local) eval()
  9. ├─factor(variable) %>% forcats::fct_reorder(importance)
 10. ├─forcats::fct_reorder(., importance)
 11. │ └─forcats:::check_factor(.f)
 12. ├─base::factor(variable)
 13. └─base::.handleSimpleError(...)
 14.   └─dplyr (local) h(simpleError(msg, call))
 15.     └─rlang::abort(message, class = error_class, parent = parent, call = error_call)
Execution halted
Output written to: ../results/tumour1_sqanti3_corrected_ml.filtered.gtf

Answer 7 · 2024-04-23T14:35:12.000Z

@MartinezRuiz-Carlos yes, that is exactly it. As the log indicates, the ML filter is being skipped, and only the intra-primming filter being applied:

Warning message:
Not enough (< 250) Novel Not in Catalog (NNC) + non-canonical transcripts.
Warning message:
Not enough (< 250) Novel Not in Catalog (NNC) transcripts, skipping ML filter.

Note that this behavior is thoroughly described in the ML filter documentation.

The solution here would be to define your own TP/TN sets. Some guidance as to how to do that can be found in our recent SQANTI3 manuscript. But since you may not have a good guess of what your true and false positive transcripts look like and/or sufficient isoforms to train the model, running the rules filter, as you already did, is probably the best option.

@jelfman this confirms mi previous guess: the error you reported is related to the ML filter not being correctly run -if run at all-. That's why you don't have the variable importance table. You may want to try the rules filter instead.

I guess that, in addition to the warnings while running the ML filter, the report could be improved to create a more specific error message and/or prevent the ML-related plots from being generated when the ML filter could not be run due to internal requirements. It's far from a major problem, but making a not for developers just in case they play around with the report in the future and want to include it at some point.

Ángeles

Answer 8 · 2024-04-23T15:46:26.000Z

Hi Carlos and Angeles,

I'm sorry I haven't replied to this -- this past week has been busy. Yes, I was able to confirm that there were insufficient TP/TN-designated samples to run, and the ML filter was not running. My confusion stemmed from seeing the successful intra-priming filter results presented in combination with ML-filter results, so this was user error.

Agreed with your suggestion, and it may just be worth including the ML error message at the bottom of the report.

Answer 9 · 2024-04-26T11:46:28.000Z

Closing since @aarzalluz solved the users' questions.