statsmaths/cleanNLP

Error in cbind_all(x) : Argument 2 must be length 514, not 361

amyhuntington opened this issue · 5 comments

Hello!

I've been following your state of the union vignette closely and have really enjoyed what you've created!

I found this discussion: #30 and have utilized your suggestions, however I too am having trouble with the pca part of the analysis and cannot find a good solution.

Here's the code:

pca <- cnlp_get_token(spacy_annotation) %>% filter(pos %in% c("NN","NNS")) %>% cnlp_get_tfidf(min_df = 0.05, max_df = 0.95, type = "tfidf", tf_weight = "dnorm") %>% cnlp_pca(cnlp_get_document(spacy_annotation))

Here's the error:

Error in cbind_all(x) : Argument 2 must be length 514, not 361

I've searched for a cbind_all solution but am coming up short. 514 is the original number of rows from spacy_annotation$document. 361 X 15 is the tfidf matrix.

What should I do?

Thanks!!

I should add, I am not using the obama/sotu data, I am using my own data. I suspect the issue comes from the fact that not every id/document contains a NN or NNS, thus resulting in less documents than the original. This seems like its going to be a fairly common use-case though.

Thanks again.

You are completely correct that the issue comes from the fact that not all of
your documents are included after the filtering, so the TF-IDF matrix does not
have enough rows. You're also correct that this is a common problem, however
there is not a particularly clean way of dealing with this due to the way that
data are being handled in cleanNLP at the moment.

For your specific case, here is a minimal working example where one document
is removed in the filter command and how to deal with it:

library(cleanNLP)
library(dplyr)

docs <- c("Hello here is simple example.", "Same here!",
          "See here too.")

cnlp_init_spacy()
spacy_annotation <- cnlp_annotate(docs)

tfidf <- cnlp_get_token(spacy_annotation) %>%
  filter(upos %in% c("VERB","NOUN")) %>%
  cnlp_utils_tfidf(min_df = -1, max_df = 2, type = "tfidf",
                   tf_weight = "dnorm")

meta <- filter(cnlp_get_document(spacy_annotation), id %in% rownames(tfidf))
pca <- cnlp_utils_pca(tfidf, meta)
pca
# A tibble: 2 x 7
  id    time       version language   uri                             PC1   PC2
  <chr> <chr>      <chr>   <chr>      <chr>                         <dbl> <dbl>
1 doc1  2018-11-0… 2.0.11  en_core_w… /var/folders/9b/0fj3dzqd4l70… -1.22     0
2 doc3  2018-11-0… 2.0.11  en_core_w… /var/folders/9b/0fj3dzqd4l70…  1.22     0

Note that I am using the version from GitHub (2.0.4), not the one on CRAN. A
few of the functions have been updated and you may need to update to get this
example working.

This is FANTASTIC and pretty simple to work through. I really appreciate your help and attention on this.

I have paired your cleanNLP approach with sentimentr. The results are incredibly illuminating and I'm at the stage where I'd like to train spaCy for NER specific to the population's common entities.

If I train a spacy model in python, do you anticipate any issues calling it (in the initializing step by name, I assume) once its ready to be used in cleanNLP?

I know this is off topic, happy to start another thread. Have you trained any spaCy models and called them with cleanNLP?

No, I actually haven't tried to do that with cleanNLP. It should work fine though. Please open a new issue if you try it and it causes you trouble. I'm curious how it goes!

@amyhuntington could you share your workflow please? perhaps share it on github?