prodriguezsosa/EmbeddingRegression

Estimate transformation matrix

odanedre opened this issue · 12 comments

Hi!

I am interested in modelling parliamentary speech using word embeddings and your awesome conText framework. To do so I need the transformation matrix.

However, when I run the compute_transform function I get the following error:

transformation_matrix <- compute_transform(cr_fcm, pre_trained, vocab, weighting = 500)


Error in filter(., term_count >= weighting) : 
  object 'term_count' not found
In addition: Warning message:
In data.matrix(data) : NAs introduced by coercion

cr_fcm is a fcm created using the congressional speech provided in your DB folder, pre_trained is the GloVe embeddings, and vocab is the vocabulary created using text2vec's create_vocabulary (as specified in the compute_transform documentation).

I checked the source code and it appears that the function is unable to find the column 'term_count' in the vocab, even though the 'term_count' variable is in vocab when I check.

Any advice regarding how to solve this?

Thanks!

Hi @odanedre, thanks for trying out conText. To use compute_transform you'd want to first estimate a full GloVe embeddings model on your local corpus, use these as the pre-trained embeddings in the compute_transform function, and an fcm specific to your corpus. I'll be adding another quick guide on this over the next few days (and will check that bug you found). However, I suggest you use the transformation matrix provided in our DB (KhodakA.rds), that one is optimized for the GloVe pre-trained embeddings in that same folder and should work well on the parliamentary speeches corpus. Take a look at the quickstart guide for how to proceed and if you run into issues, do let us know.

Thank you for your quick reply. I am able to run the code in the quickstart guide, but ran into problems when I was attempting to do a similar analysis on another dataset and needed to estimate the transformation matrix for that. To check whether it was something with my data, I tried replicating the transformation matrix in your DB with the congressional data, but ran into this problem. I'll check out the new quick guide in a couple of days - thanks!

Got you. Do keep in mind that the transformation matrix we use in our quick guide is not specific to the Congressional Record corpus, indeed we also use the same transformation matrix in the example with parliamentary speeches (the one on the meaning of empire). Unless you have a very distinctive corpus --one for which the GloVe pre-trained embeddings just don't make sense-- then I'd suggest using the provided transformation matrix (khodakA.rds), that is, no need to estimate your own.

@odanedre we pushed a fix for the bug you reported along with a quick start guide to estimate a corpus specific transformation matrix. Hope this helps.

Great! I'm using a non-English parliamenary speech corpus, so I have to train the model from scratch.
Now the filtering works perfect. However, there is another issue related to transposing the context_embeddings matrix. When I run the source it works fine, but not inside the function. Could possibly be related to this issue? https://stackoverflow.com/questions/17580935/when-writing-an-r-package-that-uses-the-matrix-package-why-do-i-have-to-specify

Yes, in all likelihood it is related to that issue. On it.

I'm using a non-English parliamenary speech corpus...

Yes, that indeed justifies estimating your own embeddings and transformation matrix :)

Alright, should be fixed now @odanedre, it was indeed a the issue you suggested. Do let me know the bug persists after you re-install or if you run into any other issues. Thanks!

Great! The function seems to work fine now. However, when I try to run the get_context function I now get the error below:

This happens both when estimating the transformation matrix myself, and when loading the transformation matrix provided in the DB (following the quick guide).

> contextR <- get_context(x = cr_corpus$speech[cr_corpus$party == 'R'], target = 'immigration', 
+                         window = 6, valuetype = "fixed", case_insensitive = TRUE, 
+                         hard_cut = FALSE, verbose = FALSE)
Error: Can't reconstruct data frame.
x The `[` method for class <kwic/data.frame> must return a data frame with 4 columns.
i It returned a <kwic/data.frame> of 7 columns.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
'kwic.character()' is deprecated. Use 'tokens()' first. 
> rlang::last_error()
<error/rlang_error>
Can't reconstruct data frame.
x The `[` method for class <kwic/data.frame> must return a data frame with 4 columns.
i It returned a <kwic/data.frame> of 7 columns.
Backtrace:
 1. conText::get_context(...)
 7. dplyr:::select.data.frame(., docname, keyword, pre, post)
 8. dplyr:::dplyr_col_select(.data, loc, names(loc))
Run `rlang::last_trace()` to see the full context.

Strange. I can't replicate that error. What version of Quanteda are you using? Can you run sessionInfo() and share the output?

Here's the output:


> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=Norwegian Bokmål_Norway.1252  LC_CTYPE=Norwegian Bokmål_Norway.1252    LC_MONETARY=Norwegian Bokmål_Norway.1252 LC_NUMERIC=C                            
[5] LC_TIME=Norwegian Bokmål_Norway.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.3.3     text2vec_0.6      quanteda_3.0.0    conText_0.1.0     lubridate_1.7.4   dplyr_1.0.5       stringr_1.4.0     data.table_1.14.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6           lattice_0.20-38      prettyunits_1.1.1    ps_1.5.0             assertthat_0.2.1     rprojroot_2.0.2      digest_0.6.27        RhpcBLASctl_0.20-137
 [9] utf8_1.2.1           R6_2.5.0             pillar_1.5.1         rlang_0.4.10         curl_4.3             rstudioapi_0.13      callr_3.5.1          Matrix_1.2-18       
[17] desc_1.2.0           devtools_2.3.2       munsell_0.5.0        compiler_3.6.1       pkgconfig_2.0.3      pkgbuild_1.2.0       tidyselect_1.1.0     tibble_3.1.0        
[25] lgr_0.4.2            fansi_0.4.2          crayon_1.4.1         withr_2.4.1          grid_3.6.1           gtable_0.3.0         lifecycle_1.0.0      DBI_1.1.0           
[33] magrittr_2.0.1       scales_1.1.1         RcppParallel_5.0.3   cli_2.3.1            stringi_1.5.3        fs_1.5.0             remotes_2.2.0        testthat_3.0.1      
[41] ellipsis_0.3.1       stopwords_2.2        generics_0.1.0       vctrs_0.3.6          fastmatch_1.1-0      tools_3.6.1          float_0.2-4          glue_1.4.2          
[49] mlapi_0.1.0          purrr_0.3.4          processx_3.4.5       pkgload_1.1.0        colorspace_2.0-0     sessioninfo_1.1.1    rsparse_0.4.0        memoise_1.1.0       
[57] usethis_2.0.0  

Thanks. I'm pretty sure it's related to an incompatibility with the new release of quanteda. According to their GitHub:

kwic(): As of version 3, only tokens objects are supported as inputs to kwic(). Calling kwic() for character or corpus objects is still functional, but issues a warning. Passing arguments to tokens() via ... in kwic() is now disabled. Users should now create a tokens object (using tokens() from character or corpus inputs before calling kwic().

This will definitely create an issue for conText.

I've added this to our to-do list.

For an immediate solution I'd suggest using an older version of quanteda. Take a look at packrat to manage the dependencies of this project and avoid messing up any of your other scripts.

I see. I'll use an older version of quanteda in the meantime then. Thank you for this - Looking forward to follow your work and start using conText!