gtonkinhill/panstripe

Hangs loading tree

Closed this issue · 12 comments

Had to add a few things to get this installed in Win11-64bit in R4.2. Main thing was had to add a few steps:

install.packages("remotes")
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("ggtree")
remotes::install_github("gtonkinhill/panstripe")
Install seemed to complete fine.
Tried to then input existing data from roary. When it got to the step to load the tree data, started to load with a "1:" but then got an hourglass, nothing progressed and had

Steps taken:

library(panstripe)
library(ape)
library(patchwork)
set.seed(1234)
setwd("f:\DNAwork\Scohnii\roaryout")
phylo.file.name <- system.file("extdata", "accessory_binary_genes.fa.newick", package = "panstripe")
rtab.file.name <- system.file("extdata", "gene_presence_absence.Rtab", package = "panstripe")
pa <- read_rtab(rtab.file.name)

�[1mindexing�[0m �[34mgene_presence_absence.Rtab�[0m [===============================] �[32m2.15GB/s�[0m, eta: �[36m 0s�[0m

Rows: 2521 Columns: 51
── Column specification ────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Gene
dbl (50): t36, t39, t37, t3, t44, t4, t20, t18, t25, t32, t50, t17, t10, t13, t14, t...

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

tree <- read.tree(phylo.file.name)
1:

Waited 10 minutes and then stopped the current computation
inputfiles.zip

In playing around some more and seeing if a different Newick tree would suffice I now see that when I invoke 'tree <- read.tree(phylo.file.name) the R system presents a '1:' as though it wants an input and if I hit enter I get:
Warning message:
In read.tree(phylo.file.name) : empty character string.
There is definitely information in the Newick file, so it is not clear is the empty string for the variable phylo.file.name or for the file specified in ' phylo.file.name <- system.file("extdata", "Poppunk_viz_core_NJ.nwk", package = "panstripe")'? I suspect the error must relate to the contents of the file because the preceding command 'pa <- read_rtab(rtab.file.name)' seemed to work fine.

more experimentation: Whereas the command 'tree <- read.tree(phylo.file.name)' does not complete, the command 'mytree <- read.tree(file= "accessory_binary_genes.fa.newick")' does complete but then when I try to run the fit command I get this:

mytree <- read.tree(file = "accessory_binary_genes.fa.newick")
fit <- panstripe(pa, mytree)
Error in panstripe(pa, mytree) :
number of taxa in tree and p/a matrix need to be the same!
If I run length(mytree) the return is 5.

I then got the package TreeTools and loaded it which allowed me to do Consensus(mytree) and find out I have 106 tips and 104 nodes, but when I do View(pa) there are only 50 rows and 108 columns and the row/clm labels don't match the counts in the Rtab file I loaded (16000x80)

Hi,

I have not been able to reproduce the issue but don't have access to a windows machine so can not check there.
The read.tree command is from the ape package which is very stable and should run on windows. Are you able to run the following checks?

library(ape)
library(panstripe)

tree <- read.tree("~/Downloads/inputfiles/accessory_binary_genes.fa.newick")
pa <- read_rtab("~/Downloads/inputfiles/gene_presence_absence.Rtab")

dim(pa)
length(tree$tip.label)

all(tree$tip.label %in% rownames(pa))

This should return 'TRUE'

Thank you very much for these additional tips/tricks/validation steps. I think this has allowed me to spot the problem, just not certain why it is a problem. This is what I did:
R code start*****

setwd("F:\DNAwork\Scohnii\roaryout")
tree <- read.tree("accessory_binary_genes.fa.newick")
pa <- read_rtab("gene_presence_absence.Rtab")

[1mindexing[0m [34mgene_presence_absence.Rtab[0m [========================================] [32m2.15GB/s[0m, eta: [36m 0s[0m

Rows: 15999 Columns: 107
── Column specification ─────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Gene
dbl (106): 073AN, 1637, 1638, 1649, 1658, 1659, 1661, 1663, 1664, 1665, 1666, 1667, 1697, 170...

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

dim(pa)
[1] 106 15999
length(tree$tip.label)
[1] 106
all(tree$tip.label %in% rownames(pa))
[1] TRUE

*R code stop
So it appears that the data in pa and tree now match when they did NOT before. So I went back and followed the steps from the instructions.

**Rcode starts

rtab.file.name <- system.file("extdata", "gene_presence_absence.Rtab", package = "panstripe")
pa <- read_rtab(rtab.file.name)

[1mindexing[0m [34mgene_presence_absence.Rtab[0m [========================================] [32m2.15GB/s[0m, eta: [36m 0s[0m

Rows: 2521 Columns: 51
── Column specification ─────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Gene
dbl (50): t36, t39, t37, t3, t44, t4, t20, t18, t25, t32, t50, t17, t10, t13, t14, t15, t29, ...

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

dim(pa)
[1] 50 2521

R code stops
So what we see is that now pa does NOT match tree, because only 50 columns were read instead of 106. I am only a novice at R but there is something strange going on by using an intermediate variable for the rtab file name. So then I tried the same recommended procedure for the tree file using a variable.
R code starts

phylo.file.name <- system.file("extdata", "accessory_binary_genes.fa.newick", package = "panstripe")
tree <- read.tree(phylo.file.name)
1:
Warning message:
In read.tree(phylo.file.name) : empty character string.

*R code stops
So the line 1: appeared with the cursor tabbed in some spaces and the mouse pointer turned into an hourglass. I had to hit enter and then it produced the warning. I then set tree back to the command that reads correctly and ran fit, but it doesn’t look good for me.
**R code starts

tree <- read.tree("accessory_binary_genes.fa.newick")

pa <- read_rtab("gene_presence_absence.Rtab")

[1mindexing[0m [34mgene_presence_absence.Rtab[0m [========================================] [32m2.15GB/s[0m, eta: [36m 0s[0m

Rows: 15999 Columns: 107
── Column specification ─────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Gene
dbl (106): 073AN, 1637, 1638, 1649, 1658, 1659, 1661, 1663, 1664, 1665, 1666, 1667, 1697, 170...

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

all(tree$tip.label %in% rownames(pa))
[1] TRUE
fit <- panstripe(pa, tree)
There were 50 or more warnings (use warnings() to see the first 50)
warnings()
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: algorithm did not converge
3: glm.fit: algorithm did not converge
4: glm.fit: algorithm did not converge
5: glm.fit: algorithm did not converge
6: glm.fit: algorithm did not converge
7: glm.fit: algorithm did not converge
8: glm.fit: algorithm did not converge
9: glm.fit: algorithm did not converge
10: glm.fit: algorithm did not converge
11: glm.fit: algorithm did not converge
12: glm.fit: algorithm did not converge
13: glm.fit: algorithm did not converge
14: glm.fit: algorithm did not converge
15: glm.fit: algorithm did not converge
16: glm.fit: algorithm did not converge
17: glm.fit: algorithm did not converge
18: glm.fit: algorithm did not converge
19: glm.fit: algorithm did not converge
20: glm.fit: algorithm did not converge
21: glm.fit: algorithm did not converge
22: glm.fit: algorithm did not converge
23: glm.fit: algorithm did not converge
24: glm.fit: algorithm did not converge
25: glm.fit: algorithm did not converge
26: glm.fit: algorithm did not converge
27: glm.fit: algorithm did not converge
28: glm.fit: algorithm did not converge
29: glm.fit: algorithm did not converge
30: glm.fit: algorithm did not converge
31: glm.fit: algorithm did not converge
32: glm.fit: algorithm did not converge
33: glm.fit: algorithm did not converge
34: glm.fit: algorithm did not converge
35: glm.fit: algorithm did not converge
36: glm.fit: algorithm did not converge
37: glm.fit: algorithm did not converge
38: glm.fit: algorithm did not converge
39: glm.fit: algorithm did not converge
40: glm.fit: algorithm did not converge
41: glm.fit: algorithm did not converge
42: glm.fit: algorithm did not converge
43: glm.fit: algorithm did not converge
44: glm.fit: algorithm did not converge
45: glm.fit: algorithm did not converge
46: glm.fit: algorithm did not converge
47: glm.fit: algorithm did not converge
48: glm.fit: algorithm did not converge
49: glm.fit: algorithm did not converge
50: glm.fit: algorithm did not converge

R code stops***

Apologies, I missed your reply.

It looks like you are running the test data rather than your dataset in this example.

This line is only needed in the example and simply finds that path of the example tree on your system

phylo.file.name <- system.file("extdata", "accessory_binary_genes.fa.newick", package = "panstripe")

Instead, you should replace phylo.file.name with the path to you phylogeny on your computer.

Hi, sorry I missed the later section of your output.
It looks like the algorithm is having trouble converging. As outlined in the documentation, this can sometimes be resolved by using a different family in the regression.

You could try running

panstripe(pa, tree, family='gaussian')

It might also be caused by issues with the phylogeny including very short branch lengths.

Hi,

First of all, thanks for the tool. The paper was eye opening about some concepts I had about pangenomes.

I just wanted to add that I had the same problem when trying to open my phylogeny file. I solved it by directly applying the path of my file. So that seemed to solve that issue.

However, I now have a problem about the amount of taxa my tree had compared with my presence absence matrix. I was wondering how this should be solved? I know for a fact that my tree has very short branches since there's no much diversity in the species I'm looking at.

Thanks for any solution you could give me.

P.S: I didn't knew if I had to open a new issue, but I decided to post it here in case someone else was wondering the same question as me.

Hi,

It is usually better to open a new issue for every question as it's easy for me to miss comments on older issues.

This might be something to do with your tree construction algorithm. Some of them will filter identical sequences. I would suggest checking that your tree has the tip names you expect by running

tree <- read.tree("path/to/your/tree.newick")
tree$tip.labels

@drhoads yes sometimes the long branch lengths can cause issues. Panstripe includes an option to automatically filter branches below a specified depth that can be used to check this.

panstripe(pa, tree, min_depth=X)