ConesaLab/SQANTI3

Errors in running sqanti3_filter.py

Upendra19993 opened this issue · 6 comments

Hi all,

I am running sqanti3. I started with the example dataset you have provided. When running the filtering step, I am getting an error and warning messages. Kindly request to have a look and assist me in resolving this issue. I have copied the complete message for your reference nd the error messages are found at the end.

(base) [uqwwijes@bun101 SQANTI3_output_original_names_after_reinstallation2]$ sqanti3_filter.py ml UHR_chr22_classification.txt
Rscript (R) version 4.3.1 (2023-06-16)
Output directory not defined. All the outputs will be stored at /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2 directory
Output name not defined. All the outputs will have the prefix UHR_chr22
Write arguments to /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2/UHR_chr22_params.txt...

Running SQANTI3 filtering...

/sw/local/rocky8/noarch/qcif/software/miniconda3/envs/sqanti3_5.2/bin/Rscript /sw/local/rocky8/noarch/qcif/software/SQANTI3-5.2/utilities/filter/SQANTI3_MLfilter.R -c /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2/UHR_chr22_classification.txt -o UHR_chr22 -d /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2 -t 0.8 -j 0.7 -i 60 -f False -e False -m False -z 3000

     SQANTI3 Machine Learning filter

CURRENT ML FILTER PARAMETERS:

[1] "sqanti_classif: /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2/UHR_chr22_classification.txt"
[2] "output: UHR_chr22"
[3] "dir: /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2"
[4] "percent_training: 0.8"
[5] "threshold: 0.7"
[6] "intrapriming: 60"
[7] "force_fsm_in: FALSE"
[8] "force_multi_exon: FALSE"
[9] "intermediate_files: FALSE"
[10] "max_class_size: 3000"
[11] "help: FALSE"

    INITIAL ML CHECKS:

Reading SQANTI3 *_classification.txt file...

Checking data for mono and multi-exon transcripts...

     ***Note: ML filter can only be applied to multi-exon transcripts.

     3338 multi-exon transcript isoforms found in SQ3 classification file.

Checking input data for True Positive (TP) and True Negative (TN) sets...

    Warning message:
     Training set not provided -will be created from input data.

Using Novel Not In Catalog non-canonical isoforms as True Negatives for training.

     - Total NNC non-canonical isoforms: 288

Not enough (< 250) Reference Match transcript isoforms among FSM,
all FSM transcripts will be used as Positive set.

     - Total FSM isoforms: 506

Balancing number of isoforms in TP and TN sets...

    Minimum set size: 288 transcripts.

    Sampled 288 transcripts to define final TP and TN sets.

Wrote generated TP and TN lists to files:

    /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2/UHR_chr22_TP_list.txt

    /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2/UHR_chr22_TN_list.txt

    ML DATA PREPARATION:

Aggregating FL counts across samples (if more than one sample is provided)...

Replacing NAs with appropriate values for ML...

Handling factor columns...

Handling integer columns...

Removing variables with near-zero variance...
Removed columns:
[1] "chrom" "RTS_stage" "n_indels"
[4] "n_indels_junc" "dist_to_CAGE_peak" "within_CAGE_peak"
[7] "dist_to_polyA_site" "within_polyA_site" "polyA_dist"

Removing highly correlated features... (correlation threshold = 0.9).

All correlations <= 0.9

    List of removed features:
    No features removed.

    RANDOM FOREST ALGORITHM RUN:

Creating positive and negative sets for classifier training and testing...

Finished creating training data set.

Partitioning data into training and test sets...

    Proportion of the data to be used for training: 0.8

Description of the training set:

    Positive and negative transcript isoforms in training set:

full-splice_match novel_not_in_catalog
231 231

    Positive and negative transcript isoforms in test set:

full-splice_match novel_not_in_catalog
57 57


Training Random Forest Classifier...

    ***Note: this can take up to several hours.

Pre-defined Random Forest parameters (supplied to caret::trainControl()):
- Downsampling in training set (sampling = 'down').
- 10x cross-validation (repeats = 10).

Loading required package: ggplot2
Loading required package: lattice

Random forest training finished.

Saved generated classifier to randomforest.RData file.


Random forest evaluation: applying classifier to test set...

Test set evaluation results:

AUC, Sensitivity and Specificity on test set:
ROC Sens Spec
0.9713758 0.7719298 0.9824561

Writing summary to testSet_summary.txt file.

Confusion matrix:
Reference
Prediction POS NEG
POS 44 1
NEG 13 56

Writing confusion matrix and statistics to output files:
testSet_confusionMatrix.txt
testSet_stats.txt

Global variable importance in Random Forest classifier:
Overall
min_cov 35.7541546
min_sample_cov 35.5178859
bite 27.7809271
gene_exp 18.3663935
sd_cov 17.7682316
predicted_NMD 14.7867100
iso_exp 10.6466531
diff_to_gene_TSS 9.1437091
ratio_TSS 8.2086366
FSM_class 7.5753832
length 6.9421819
diff_to_gene_TTS 5.9650142
exons 5.8164596
perc_A_downstream_TTS 5.2488334
ratio_exp 0.6994311
coding 0.5164620

Variable importance table saved as classifier_variable-importance_table.txt

Calculating and printing test set ROC curves...
Setting levels: control = 1, case = 2
Setting direction: controls > cases
Setting levels: control = 1, case = 2
Setting direction: controls < cases

ROC curves saved to testSet_ROC_curve.pdf file. Includes:
- ROC curve with unbalanced classes.
- ROC curve with balanced classes.


Applying Random Forest classifier to input dataset...

Random forest prediction finished successfully!

Random forest classification results:

Negative Positive
1648 1690
Warning message:
package ‘ggplot2’ was built under R version 4.3.2


Applying intra-priming filter to our dataset.

Intra-priming filtered transcripts:

FALSE TRUE
3213 712


Writing filter results to classification file...

    Wrote filter results (ML and intra-priming) to new classification table:
    UHR_chr22_MLresult_classification.txt file.

    Wrote isoform list (classified as non-artifacts by both ML and intra-priming
    filters) to UHR_chr22_inclusion-list.txt file

SUMMARY OF MACHINE LEARNING + INTRA-PRIMING FILTERS:

Artifact Isoform
2177 1748


SQANTI3 ML filter finished successfully!



     SQANTI3 Machine Learning filter report

Loading required package: magrittr

Reading ML result classification table...

Reading classifier variable importance table...
Rows: 16 Columns: 2
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): variable
dbl (1): importance

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

Reading ML filter parameters...
Rows: 53 Columns: 2
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): parameter, value

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

Reading ML performance statistics...
Rows: 18 Columns: 2
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): metric
dbl (1): value

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
Rows: 4 Columns: 3
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): Prediction, Reference
dbl (1): Freq

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
Warning message:
There were 2 warnings in dplyr::mutate().
The first warning was:
ℹ In argument: structural_category = %>%(...).
Caused by warning:
! Unknown levels in f: genic_intron
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning.

Loading required package: ggplot2
Warning message:
package ‘ggplot2’ was built under R version 4.3.2
Warning in install.packages("RColorConesa") :
'lib = "/sw/local/rocky8.6/noarch/qcif/software/miniconda3/envs/sqanti3_5.2/lib/R/library"' is not writable
Error in install.packages("RColorConesa") : unable to install packages
Calls: suppressMessages -> withCallingHandlers -> install.packages
Execution halted
(base) [uqwwijes@bun101 SQANTI3_output_original_names_after_reinstallation2]$

Many thanks,
Upendra.

Hi @Upendra19993,
Unfortunately, it seems you were missing a simple R package for coloring the resulting plots. When you installed SQANTI3, did you install the conda environment "SQANTI.env"? If you install the environment, it makes all the installations to the correct versions needed by the package (You can check how to do this in the SQANTI3 documentation: https://github.com/ConesaLab/SQANTI3/wiki/Dependencies-and-installation#2-creating-the-conda-environment )

It seems you are running SQANTI3 from your "base" environment. You should either install the SQANTI3 environment and then run "conda activate SQANTI.env", or install the RColorConesa package (https://cran.r-project.org/web/packages/RColorConesa/index.html ) on your base environment.

Hi carolinamonzo,

No, I didn't install conda environment "SQANTI.env when installed sqanti3.

But now I installed sqanti3 in conda environment "SQANTI.env and ran the filtering step. I didn't get the previous error of missing RColorConesa package, but got warning messages regarding accessing the CRAN to install or load packages and ggplot2.
The message is as below.

Warning message:
There were 2 warnings in dplyr::mutate().
The first warning was:
ℹ In argument: structural_category = %>%(...).
Caused by warning:
! Unknown levels in f: genic_intron
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning.
Loading required package: ggplot2
Warning message:
package ‘ggplot2’ was built under R version 4.3.2
Error in contrib.url(repos, type) :
trying to use CRAN without setting a mirror
Calls: suppressMessages ... withCallingHandlers -> install.packages -> startsWith -> contrib.url
Execution halted
(SQANTI3.env) [uqwwijes@bunya3 SQANTI3-5.2]$

I get all the output files, but not sure whether they are accurate due to warning messages I get. Could you please have a look and suggest on how to proceed to resolve this issue.

Many thanks,
Upendra.

Hi @Upendra19993 the warnings are not worrisome. I'll update the SQANTI installation steps so the warning doesn't appear.
In your case, it installed from the cloud since the source wasn't specified. You can go ahead and continue with your analysis, the warnings you found have not affected your data.

Best,
Carolina.

Many thanks, carolinamonzo!

I met the same problem, it's seems like it were not been fixed in the newest SQANTI3-5.2?

@CaiCheng1996 file.edit(".Rprofile")
options(repos = c(CRAN = "https://cloud.r-project.org"))
Try this. That did the trick, amazing.