NAs produced by integer overflow

Question

NAs produced by integer overflow

Closed this issue 3 years ago · 10 comments

Hi,
Thanks for the package, I have been trying it out lately.
There are two places in the code where I encounter 'NAs produced by integer overflow' error.

From this vignette:
https://mpievolbio-it.pages.gwdg.de/crbhits/articles/V02KaKsVignette.html
Section 1.4.2 and 1.6 (DAGchainer, type = "bp").

The error is a follows:

Problem with `mutate()` column `gene.mid`.
`gene.mid = (start + end)/2`.

This is most likely because I have large chromosomes approaching 2Gb, so the 'start + end' operation results in ~4Gb.

The temporary fix I suppose could be to change the formula to (start + (end - start)/2) (maybe not sure!).

Agnieszka

Answer 1 · 2021-12-07T08:20:16.000Z

Dear Agniezka,

thank you to bring up this issue.

Could you tell me with which species you are working to look into the GFF or could you provide the GFF line for the gene that is causing the problem?

I guess this error is in the rbh2dagchainer.R and also maybe in the cds2genepos.R function and might effect all code which uses the dplyr::mutate.

I will try to fix asap.

Best regards

Answer 2 · 2021-12-07T08:32:23.000Z

Thanks! The genome is not publicly available but here is a line from gff

chr1L   Liftoff gene    1805129445      1805130092      .       +       .       ID=Hed006089
chr1L   Liftoff mRNA    1805129445      1805130092      .       +       .       ID=Hed006089.1;Parent=Hed006089
chr1L   Liftoff exon    1805129445      1805130092      .       +       .       Parent=Hed006089.1
chr1L   Liftoff CDS     1805129445      1805130092      .       +       0       Parent=Hed006089.1

The second error looks as follows (from the rbh2dagchainerm but only using "bp"):

1: Problem with mutate() column gene1.mid.
gene1.mid = (gene1.start + gene1.end)/2.
NAs produced by integer overflow
2: Problem with mutate() column gene2.mid.
gene2.mid = (gene2.start + gene2.end)/2.
NAs produced by integer overflow

I have one more question. Is there a quick way to calculate KaKs only for rbh pairs identified as syntenic by DAGchainer?

Agnieszka

Answer 3 · 2021-12-07T09:44:23.000Z

Hi regarding the second question one way to do it would be starting from your dagchainer results and the species1_specie2_crbh results.

I will just refer here to the Vignette species1: ARATHA; species2: ARALYR

INPUT1:ARATHA_ARALYR_crbh.dagchainer.bp

INPUT2: ARATHA_ARALYR_crbh

# get idx of genes to retain
syntenic.pairs.idx <- which( paste0(ARATHA_ARALYR_crbh$crbh.pairs$aa1, ARATHA_ARALYR_crbh$crbh.pairs$aa2)
%in%
paste0(ARATHA_ARALYR_crbh.dagchainer.bp$gene1.seq.id, ARATHA_ARALYR_crbh.dagchainer.bp$gene2.seq.id))

# create a new results just retaining the dagchainer pairs based on ARATHA_ARALYR_crbh
ARATHA_ARALYR_crbh_syntenic <- ARATHA_ARALYR_crbh
ARATHA_ARALYR_crbh_syntenic$crbh.pairs <- ARATHA_ARALYR_crbh_syntenic$crbh.pairs[syntenic.pairs.idx,]

# follow the vignette to calculate Ka/Ks
## calculate Ka/Ks using the CRBHit pairs and multiple threads
ARATHA_ARALYR_crbh_syntenic.kaks.Li <- rbh2kaks(rbhpairs = ARATHA_ARALYR_crbh_syntenic,
    cds1 = ARATHA.cds.longest,
    cds2 = ARALYR.cds.longest,
    model = "Li",
    threads = 4)

I have seen that even among the dagchainer pairs there are some duplicates, so do not wonder if this is not equal:

e.g.

> dim(ARATHA_ARALYR_crbh.dagchainer.bp)
[1] 25007    15
> length(syntenic.pairs.idx)
[1] 24754

This is related to some so called "reverse" syntnic groups in dagchainer.

I will also include that in the upgrade to have this "reverse" groups splitted so that there will be no more duplicated groups.

Anyhow, you should be fine running the Ka/Ks just on the selected pairs, since they will not change even if the grouping is altered into "reverse" and "normal"

This is e.g. the output for dagchainer and a duplicated group, you will see that the naming until now was based on the aa1 and aa2 and not considering the "(reverse) flag" which I will integrate now.

  [1] "## alignment AA1:NC_000932.1 vs. AA2:NC_034379.1 Alignment #1  score = 3524.7 (num aligned pairs: 79):"                 
  [2] "## alignment AA1:NC_000932.1 vs. AA2:NC_034379.1 (reverse) Alignment #1  score = 294.0 (num aligned pairs: 6):"

Answer 4 · 2021-12-07T10:40:40.000Z

@agolicz I have changed the mid point calculation, maybe you can provide feedback, if now the integer overflow is gone.

Thank you in anticipation to contribute that issue.

I have also now changed the dagchainer output to include the "reverse" flag.

Best regards

Kristian

Answer 5 · 2021-12-07T12:12:24.000Z

Just tested the new version, the issue with rbh2dagchainer seems to be fixed now.
Many thanks!
It may be also worth it up update section 1.4.2 of the KaKs vignette to the new midpoint calculation.

ARATHA.gff.mRNA.longest <- ARATHA.gff.mRNA %>% dplyr::arrange(seqname, start,
  gene.id, desc(mRNA.len)) %>% dplyr::distinct(gene.id, .keep_all = TRUE) %>%
  dplyr::mutate(gene.mid = (start+end)/2)

Answer 6 · 2021-12-07T12:30:57.000Z

Hi,
should be already in there it just takes some time until the webpage is loading the new content.

I close the issue.

If you have other issues or if I can help running CRBHits, please let me know.

Best regards

Answer 7 · 2021-12-11T23:50:29.000Z

I possibly found one more issue that could be related to the chromosomes sizes? Not sure...
Trying to use the function plot_dagchainer gives the following error:

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.
Error in FUN(X[[i]], ...) : subscript out of bounds
Calls: <Anonymous> ... ggplot_gtable.ggplot_built -> <Anonymous> -> f -> lapply -> FUN -> lapply
In addition: Warning message:
Removed 175 rows containing missing values (geom_point).

Code snippet:

HED_TIF_crbh.dagchainer.idx <- rbh2dagchainer(rbhpairs = HED_TIF_crbh, gene.position.cds1 = HED.INFO, gene.position.cds2 = TIF.INFO, type = "idx", gap_length = 1, max_dist_allowed = 20) # no issues, expected no of genes etc
png("Chrs.png", width = 9, height = 9, units = "in", res=300)
plot_dagchainer(HED_TIF_crbh.dagchainer.idx)
dev.off()

> packageVersion("ggplot2")
[1] ‘3.3.5’
> packageVersion("CRBHits")
[1] `‘0.0.3’`

Answer 8 · 2021-12-12T00:17:06.000Z

Hi @agolicz,
would it be possible to share securely the data, so that I can have a deeper look why the plotting fails?

Thank you in anticipation.

I could send a cryptshare upload link if you want.

Answer 9 · 2021-12-12T00:49:52.000Z

I have the HED_TIF_crbh.dagchainer.idx.RData object which I can share without much issue.
Would that be enough to debug?

Agnieszka

Answer 10 · 2021-12-12T08:51:00.000Z

Hi, yes dagchainer result file should be enough.