Error running phyloglm, number of species
Closed this issue · 8 comments
Dear @lamho86 and @cecileane , I'm trying to use phylolm::phyloglm
to run a phylogenetic logistic regression since yesterday, but I keep getting these error messages saying that the number of rows (species) in the data and in the phylogeny don't match. However, they do have the same number of species, and I was careful enough to get the rows of the data frame in the same order of the tip.labels, to match the function example. I really don't know what can be wrong. Please, see below the code chunk:
> phylogeny
Phylogenetic tree with 800 tips and 799 internal nodes.
Tip labels:
Neoromicia_roseveari, Neoromicia_brunneus, Neoromicia_tenuipinnis, Neoromicia_rendalli, Neoromicia_nanus, Laephotis_wintoni, ...
Rooted; includes branch lengths.
> dim(dados)
[1] 800 11
> head(dados)
Familia Presenca.ecolocalizacao Ref.ecol
Neoromicia_roseveari Vespertilionidae SIM NA
Neoromicia_brunneus Vespertilionidae SIM NA
Neoromicia_tenuipinnis Vespertilionidae SIM NA
Neoromicia_rendalli Vespertilionidae SIM NA
Neoromicia_nanus Vespertilionidae SIM NA
Laephotis_wintoni Vespertilionidae SIM NA
Atvidade.Circadiana Ref.Atv
Neoromicia_roseveari <NA> <NA>
Neoromicia_brunneus NOTURNO <NA>
Neoromicia_tenuipinnis NOTURNO EltonTraitsDataBase
Neoromicia_rendalli NOTURNO EltonTraitsDataBase
Neoromicia_nanus NOTURNO <NA>
Laephotis_wintoni NOTURNO EltonTraitsDataBase
Dieta Ref.Diet
Neoromicia_roseveari INSETIVORO <NA>
Neoromicia_brunneus INSETIVORO <NA>
Neoromicia_tenuipinnis INSETIVORO EltonTraitsDataBase
Neoromicia_rendalli INSETIVORO EltonTraitsDataBase
Neoromicia_nanus INSETIVORO <NA>
Laephotis_wintoni INSETIVORO EltonTraitsDataBase
Tipo.ecolocalizacao Ref.tipoEco Diet_frug_disc
Neoromicia_roseveari LARINGE NA NAOFRUGIVORO
Neoromicia_brunneus LARINGE NA NAOFRUGIVORO
Neoromicia_tenuipinnis LARINGE NA NAOFRUGIVORO
Neoromicia_rendalli LARINGE NA NAOFRUGIVORO
Neoromicia_nanus LARINGE NA NAOFRUGIVORO
Laephotis_wintoni LARINGE NA NAOFRUGIVORO
Ati_circad_disc
Neoromicia_roseveari <NA>
Neoromicia_brunneus NOTURNO
Neoromicia_tenuipinnis NOTURNO
Neoromicia_rendalli NOTURNO
Neoromicia_nanus NOTURNO
Laephotis_wintoni NOTURNO
> eco_bat <- phyloglm(Presenca.ecolocalizacao~Diet_frug_disc, data=dados, phy=phylogeny, method = "logistic_IG10", boot=100)
Error in phyloglm(Presenca.ecolocalizacao ~ Diet_frug_disc, data = dados, :
the number of rows in the data does not match the number of tips in the tree.
Thank you in advance
Merci beaucoup pour votre attention
Diogo
P.S. My phylogeny is not ultrametric
> is.ultrametric(tre)
[1] FALSE
But the one from the example isn't either. I really don't know what's happening.
It might be from some non-ASCII character that creeped into a taxon name, either in the trait data, or in the tree. The difference could be invisible by eye, but R would still see it...
Hi Diogo, please feel free to send your files offline (tree file and and part of trait file --just 1 column, say). I suspect that it's caused by space or non-ASCII characters, but I can't really help without being able to reproduce the error. Another possibility is that some species name is spelled differently in the tree and in the data file. To check, you could do this:
setdiff(rownames(dados), phylogeny$tip.label)
to see which taxa have trait data but are not in your tree.
And to see which taxa are in your tree but are not in your trait data:
setdiff(phylogeny$tip.label, rownames(dados))
Regardless of where the problem comes from, knowing which taxa do not match could be helpful.
Hi @cecileane , I'm sorry for getting so late to this.
Thank you very much for the suggestion. I'll try it here and if it doesn't work properly, I'll send you a shortened version of the data and the code I'm using.
All the best,
Diogo
Yes, we just checked for any difference in species names in the phylogeny and trait data table, but everything looked fine. The code
setdiff(phylogeny$tip.label, rownames(dados))
returned
NULL
We did other analysis with the same dataset (e.g., ancestral state estimation) and nothing happened. We'll arrange the table to send you a shortened version of it.
Thanks again
this is strange: getting NULL
suggests that the first vector (or both vectors) is (are) empty. examples:
> setdiff(c(1,4,5), c(5,1,4,3))
numeric(0)
> setdiff(c(), c())
NULL
> setdiff(NULL, NULL)
NULL
> setdiff(c(1), NULL)
[1] 1
> setdiff(NULL, c(1))
NULL
Thank you @diogoprov for diagnosing this: the problem was due to some taxa missing data for some of the predictors. The entire data frame has the same taxa (as row names) as in the tree, but after ignoring taxa that were missing some data (response or predictor), the data frame reduced to taxa with no missing entry had a smaller taxon set than the tree. For others running into the same problem, here is a small example to see if this is an issue, and a quick fix.
first, set-up a small data set with no missing data, and everything goes well
tre1 = read.tree(text="(((((t19:0.952,t17:0.872):0.0803,(t18:0.605,(t16:0.74,t15:0.896):0.225):0.712):0.824,((t3:0.435,t5:0.575):0.515,((t11:0.362,(t2:0.749,t1:0.507):0.335):0.678,(t12:0.76,t10:0.964):0.862):0.741):0.294):0.119,((t4:0.267,(t14:0.196,t6:0.0196):0.176):0.837,t7:0.528):0.834):0.231,(((t8:0.00299,t13:0.361):0.591,t9:0.129):0.148,t20:0.0146):0.182);")
dat1 = data.frame(y = c(0,1,0,0,1,1,1,1,0,0,1,1,0,1,1,1,1,1,1,1),
x = c(0,1,0,0,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1,1))
rownames(dat1) = c("t19","t17","t18","t16","t15","t3","t5","t11","t2","t1","t12","t10","t4","t14","t6","t7","t8","t13","t9","t20")
fit = phyloglm(y~x, phy=tre1, data=dat1) # all good
summary(fit) # all good
next: mask some data points to introduce missing data.
dat2 = dat1
dat2$x[3] = NA # t18 will need to be ignored
phyloglm(y~x, phy=tre1, data=dat2)
which reproduces the error message that @diogoprov had:
Error in phyloglm(y ~ x, phy = tre2, data = dat2) :
the number of rows in the data does not match the number of tips in the tree.
Quick bug fix:
- first obtain the list of species that have one or more missing values
now
mf = model.frame(x~y, data=dat2) # model.frame removes rows with missing data, converts factors to dummy variables, etc. taxa_without_data = setdiff(tre1$tip.label, rownames(mf))
taxa_without_data
has the list of problematic taxa, that will need to be pruned off the tree:> taxa_without_data [1] "t18"
- second: take the problematic taxa off the tree.
tre2 = tre1 # good if there are no missing data if (length(taxa_without_data)>0) { tre2 = drop.tip(tre1, taxa_without_data) # drop taxa with missing data from the tree }
now phyloglm
should run fine, if the input tree has the problematic taxa pruned off:
phyloglm(y~x, phy=tre2, data=dat2) # all good, even though dat2 still has a row for "t18"
phyloglm(y~x, phy=tre2, data=dat2[-3,]) # all good, same results
Thanks a lot for confirming this.
Best,
Diogo