GoekeLab/proActiv

Error in `[[<-`(`*tmp*`, name, value = new("SimpleIntegerList", elementType = "integer",

mmpust opened this issue · 4 comments

mmpust commented

Hi,
I am running into the following error message:

# download file
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/968/075/GCF_019968075.1_ASM1996807v1/GCF_019968075.1_ASM1996807v1_genomic.gff.gz
# unzip and replace seqnames
gunzip GCF_019968075.1_ASM1996807v1_genomic.gff.gz
sed -i 's/NZ_CP065381.1/chr1/g' GCF_019968075.1_ASM1996807v1_genomic.gff
# Run in R
txdb <- makeTxDbFromGFF("GCF_019968075.1_ASM1996807v1_genomic.gff")

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  the transcript names ("tx_name" column in the TxDb object) imported from the "Name" attribute are not unique
tx <- transcripts(txdb, columns=c("gene_id", "tx_id", "tx_name"))
tx

GRanges object with 2695 ranges and 3 metadata columns:
         seqnames          ranges strand |         gene_id     tx_id       tx_name
            <Rle>       <IRanges>  <Rle> | <CharacterList> <integer>   <character>
     [1]     chr1          1-1323      + |   I5Q00_RS00005         1          dnaA
     [2]     chr1       1724-2830      + |   I5Q00_RS00010         2          dnaN
     [3]     chr1       2842-3057      + |   I5Q00_RS00015         3 I5Q00_RS00015
     [4]     chr1       3061-4182      + |   I5Q00_RS00020         4 I5Q00_RS00020
     [5]     chr1       4270-4521      + |   I5Q00_RS00025         5 I5Q00_RS00025
     ...      ...             ...    ... .             ...       ...           ...
  [2691]     chr1 2813427-2814548      - |   I5Q00_RS13455      2691 I5Q00_RS13455
  [2692]     chr1 2814554-2815636      - |   I5Q00_RS13460      2692 I5Q00_RS13460
  [2693]     chr1 2815899-2816198      - |   I5Q00_RS13465      2693          yidD
  [2694]     chr1 2816195-2816593      - |   I5Q00_RS13470      2694          rnpA
  [2695]     chr1 2816841-2816975      - |   I5Q00_RS13475      2695          rpmH
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
promoterAnnotation <- preparePromoterAnnotation(txdb, species='Faecalibacterium')

Extract exons by transcripts...
Identify overlapping first exons for each gene...
Prepare mapping between transcripts, tss, promoters and genes...
Prepare annotated intron ranges...
Annotating reduced exon ranges...

Error in `[[<-`(`*tmp*`, name, value = new("SimpleIntegerList", elementType = "integer",  : 
  2667 elements in value to replace 2695 elements
  
In addition: Warning messages:
1: In .set_group_names(grl, use.names, txdb, by) :
  some group names are NAs or duplicated
2: In .set_group_names(ans, use.names, x, "tx") :
  some group names are NAs or duplicated
3: There was 1 warning in `mutate()`.
ℹ In argument: `IntronEndRank = max(.data$INTRONRANK) - .data$INTRONRANK + 1`.
Caused by warning in `max()`:
! no non-missing arguments to max; returning -Inf 
4: There was 1 warning in `mutate()`.
ℹ In argument: `MinIntronRank = min(.data$INTRONRANK)`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf 
5: There was 1 warning in `mutate()`.
ℹ In argument: `MaxIntronRank = max(.data$INTRONRANK)`.
Caused by warning in `max()`:
! no non-missing arguments to max; returning -Inf 
6: There was 1 warning in `mutate()`.
ℹ In argument: `TxWidthMax = max(.data$TxWidth)`.
Caused by warning in `max()`:
! no non-missing arguments to max; returning -Inf 
7: There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `MinMergedIntronRank = min(.data$MinIntronRank)`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
ℹ Run dplyr::last_dplyr_warnings() to see the 2 remaining warnings. 
8: There was 1 warning in `filter()`.
ℹ In argument: `.data$MinIntronRank == min(.data$MinIntronRank)`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf 

Any ideas what I can do about it?
Thanks in advance!

Hi @mmpust it looks like the transcript names are not unqiue, which could cause a problem? can you make the transcript names unique, for example by adding a running index or something like that?

I also meet this problem when I used the gtf file to do annotation, how to solve it?
Thank you very much!

library(proActiv)
## From GTF file path
gtf.file <- '/mnt/ruiyanhou/nfs_share2/RNA_seq_organ_species/chicken/ref_files/galGal4.ensGene.gtf'

promoterAnnotation.gencode.v34.subset <- preparePromoterAnnotation(file = gtf.file,
                                                                   species = 'Gallus_gallus')

The error looks like this

image

I haven't had time to look into this yet in detail, but could you first try subsetting your gtf to standard chromosomes:

awk -F'\t' '$1 ~ /^chr([0-9]+|M)$/' galGal4.ensGene.gtf > galGal4.ensGene.filtered.gtf

then create the transcript database and build promoter annotations?

library(GenomicFeatures)
library(proActiv)

path <- 'galGal4.ensGene.filtered.gtf'
txdb <- makeTxDbFromGFF(path)
anno <- preparePromoterAnnotation(txdb = txdb, species = 'galGal')

Thank you for your quick response!
It works according to your suggestion! Thank you very much!

image

But I meet another problem in the new issue, could you help me? Thank you!