molgenis/systemsgenetics

Downstreamer : java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException

nrosewick opened this issue · 3 comments

I try to apply Downstreamer to my dataset. I couldn't find information regarding the choice of parameter for the permutation number. I selected 100,000 for the main permutation number ; and 100 for the other permutation i.e. permutationGeneCorrelations, permutationPathwayEnrichment, etc..

During Step1 I got a OutOfBounds error.

Here's the command I use :

mvn exec:java -D exec.mainClass=nl.systemsgenetics.downstreamer.Downstreamer -Dexec.args="--mode STEP1 --gwas zscore.txt --genes ensembl_genes.txt --maf 0.1 --output out_step1 --permutations 100000 --variantCorrelation 0.95 --window 10000 --permutationGeneCorrelations 100 --permutationPathwayEnrichment 100 --permutationFDR 100 --genePruningR 0.8 --referenceGenotypes plink_genotype --referenceGenotypeFormat PLINK_BED -t 6 --debug

Any idea how to solve this issue ?

Thank you

Here's the log file :

  /---------------------------------------\
  |             Downstreamer              |
  |                                       |
  |  University Medical Center Groningen  |
  \---------------------------------------/

       --- Version: 1.24-SNAPSHOT ---

More information: https://github.com/molgenis/systemsgenetics/wiki/Downstreamer

Current date and time: 2021-11-05 10:48:29

Supplied options:
 * Mode: STEP1
 * Ouput path: out_step1
 * Gwas Z-score matrix: zscore.txt
 * Reference genotype data: plink_genotype 
 * Reference genotype data type: Plink BED / BIM / FAM files
 * MAF filter: 0.1
 * Gene window extend in bases: 10 000
 * Initial number of permutations to calculate gene p-values: 100 000
 * Max number of rescue permutations to calculate gene p-values if RUBEN has failed: 100 000
 * Max correlation between variants: 0.95
 * Correcting for lambda inflation: off
 * Save which variants that are used per gene to calculate the gene p-value: off
 * Gene info file: ensembl_genes.txt
 * Number of threads to use: 6
 * Number of permutations to use to calculate gene correlations: 20
 * Number of permutations to use for pathway enrichments: 20
 * Window to calculate gene correlations for GLS: -1
 * Gene pruning r: 0.8
 * Ignoring gene correlations: off
 * Force normal gene p-values: off
 * Quantile normalize permuted gene p-values: off
 * Force normal pathway p-values: off
 * Exclude HLA during enrichment analysis: off
 * Save output as excel files: off
 * No pathway databases specified
 * Debug mode: on (this will result in many intermediate output files)
Number of phenotypes in GWAS matrix: 2
Number of variants in GWAS matrix: 7 622 488
Read 2107 samples from plink_genotype.fam
Read 7451463 SNPs from plink_genotype.bim
Done loading genotype data
Loaded 63736 genes
Prepared reference null distribution with 50 000 000 values
Gene p-value calculations  11% [===================>                                                                                                                                                      ]  7192/63736 (0:06:49 / 0:53:39) Problem running: gene p-value calculator
java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: Index 100000 out of bounds for length 100000
	at nl.systemsgenetics.downstreamer.gene.GenePvalueCalculator$CalculatorThread.run(GenePvalueCalculator.java:1217)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 100000 out of bounds for length 100000
	at nl.systemsgenetics.downstreamer.gene.GenePvalueCalculator.runPermutationsUsingEigenValues(GenePvalueCalculator.java:1098)
	at nl.systemsgenetics.downstreamer.gene.GenePvalueCalculator.runGene(GenePvalueCalculator.java:675)
	at nl.systemsgenetics.downstreamer.gene.GenePvalueCalculator.access$200(GenePvalueCalculator.java:58)
	at nl.systemsgenetics.downstreamer.gene.GenePvalueCalculator$CalculatorThread.run(GenePvalueCalculator.java:1214)
	... 1 more

downstreamer_step1.log

Thank you for reporting this and pointing out that we should put recommend number of permutations to manual.

We did not see this error before. We will get back to you.

Hi @nrosewick Thanks for reporting. Of the top of my head I don't have an answer but we will look into the cause as soon as we are able. This might take some time tough. We previously did have a bug were a similar error was thrown when the number of permutations was not divisible by 10k. Perhaps you can try to set the following to see if that does work:

  • Initial number of permutations to calculate gene p-values: 1 000
  • Max number of rescue permutations to calculate gene p-values if RUBEN has failed: 100 000

I did notice that the number of permutations noted in the command you posted and the ones in the log don't match up for the pathway and gene correlation permutations, they are listed as 20 in the log file, but 100 in the command. If you are sure that the log and command are correctly matched @nrosewick , we should also look for an issue with input parsing @PatrickDeelen.

On a side note, we don't recommend fewer than 10k permutations for calculating the gene correlation matrices, as we previously observed that they are unstable in that case. The same goes for the permutations used to estimate the p-values. Ofc for testing this is fine. Please find the recommended settings regarding permutations below, we do realise that this is very computationally intensive, but it is key to get stable results.

# Recommended permutation settings
# Permutations for PASCAL
PERMUTATIONS=100000
# Permutations if PASCAL fails
PERMUTATION_RESCUE=10000000
# Number of permutations (Random GWASs) for pathway enrichment
PERMUTATION_PATHWAY=10000
# Number of permutations (Randon GWASs) used to calculate gene-gene correlation
PERMUTATION_GENECOR=10000
# Number of permutations (random GWASs) for FDR calculation
PERMUTATION_FDR=100

Did it work out using the recommended number of permutations suggested by Olivier?

If needed we are happy to run Downstreamer for you if you can share your summary statistics with us.