Issues with importRdata()
R-relota opened this issue · 1 comments
Hello,
currently, I am trying to run IsoformSwitchAnaylzer with Kallisto counts. After successfully using importIsoformExpression() to extract my count data, I wanted to set up the switchAnalyzeRlist. The kallisto index was built using the Homo_sapiens.GRCh38.cdna.all.fa.gz Ensembl Version 111. The corresponding file that I used: Homo_sapiens.GRCh38.110.chr_patch_hapl_scaff.gtf.gz.
I already checked if my transcripts have version numbers or not. I also tried using gtf file of older Ensembl versions (110,109, ..)
It seems like I have a different set of transcripts in the gtf. What gtf should be used best?
aSwitchList <- importRdata(
isoformCountMatrix = kallistoQuant$counts,
isoformRepExpression = kallistoQuant$abundance,
designMatrix = myDesign,
isoformExonAnnoation = file.path("C://Users//OneDrive//Desktop//Homo_sapiens.GRCh38.110.chr_patch_hapl_scaff.gtf.gz"),
isoformNtFasta = file.path("C://Users//OneDrive//Desktop//Homo_sapiens.GRCh38.cdna.all.fa.gz"),
#fixStringTieAnnotationProblem = TRUE,
removeNonConvensionalChr= TRUE,
ignoreAfterPeriod = TRUE
#showProgress = FALSE
)
The following error occurred:
Step 1 of 6: Checking data...
Step 2 of 6: Obtaining annotation...
importing GTF (this may take a while)
Fehler in importRdata(isoformCountMatrix = kallistoQuant$counts, isoformRepExpression = kallistoQuant$abundance, :
The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (Jaccard similarity < 0.925).
Either isforoms found in the annotation are not quantifed or vise versa.
Specifically:
187501 isoforms were quantified.
251758 isoforms are annotated.
Only 187501 overlap.
0 isoforms quantifed isoforms had no corresponding annoation
This combination cannot be analyzed since it will cause discrepencies between quantification and annotation thereby skewing all analysis.
If there is no overlap (as in zero or close) there are two options:
- The files do not fit together (e.g. different databases, versions, etc) (no fix except using propperly paired files).
- It is somthing to do with how the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments.
Examples from expression matrix are : ENST00000592124, ENST00000635892, ENST00000522296
Examples of annoation are : ENST00000378731, ENST00000568656, ENST00000687234
Examples of isoforms which were only found im the quantification are :
If there is a large overlap but still far from complete there are 3 possibilites:
- The files do not fit together (e.g different databases versions etc.) (no fix except using propperly paired files).
- If you are using Ensembl data you have supplied the GTF without phaplotyps. You need to supply the <Ensembl_version>.chr_patch_hapl_scaff.gtf file - NOT the <Ensembl_version>.chr.gtf
- One file could contain non-chanonical chromosomes while the other do not (might be solved using the 'removeNonConvensionalChr' argument.)
- It is somthing to do with how a subset of the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments.
For more info see the FAQ in the vignette.
I hope you can help me fixing this issue! Thanks in advance.
Hi! As you used Kallisto, a reference-only workflow, the gtf file should correspond to the fasta file you used to build the reference index. So version 111 should be correct. As for the error, I guess it's because of the incompatibility of GTF file, so please try version 111. Also I suggest you could take a look at the error messages 4 and use ?importRdata for further info about these 3 arguments. If it's still unresolved, maybe you could share the files with me and then I can take a look. 😊