```cooccurrence``` group argument not working properly
kollmi opened this issue · 5 comments
Hello,
I am trying to create a cooccurence table with columns doc_id, term1, term2, and cooc.
Using the sample data, the group
argument fails to create a doc_id
column.
> data("brussels_reviews_anno")
> x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr")
> x <- cooccurrence(x, group = "doc_id", term = "lemma")
> head(x)
term1 term2 cooc
1 appartement sejour 199
2 agreable appartement 178
3 appartement bon 157
4 accueil appartement 103
5 agreable sejour 102
6 appartement quartier 101
However, when converting the annotated df to data.table
and then grouping using by
, I get the desired result:
> x <- as.data.table(brussels_reviews_anno)
> x <- subset(x, language == "nl" & xpos %in% c("NN"))
> x <- x[, cooccurrence(lemma, order = FALSE), by = list(doc_id)]
> head(x)
doc_id term1 term2 cooc
1: 19991431 plek centraal 1
2: 19991431 centraal centrum 1
3: 19991431 centrum brussel 1
4: 19991431 brussel adres 1
5: 19991431 adres brussel 1
6: 21054450 appartement locatie 1
I am fine with doing this workaround for now, but think it would flow nicely if the argument worked with data frames.
Specs:
Package version 0.8.6
R version 4.0.3 (2020-10-10)
Thanks in advance!
Thanks for the remark, the function was explicitely setup to be used like this if you need that data at another level. Good you found that one out.
Sometimes you just want the aggregate over all documents while making sure the calculated cooccurrence are calculated within a document (your first example), sometimes you want it within a group like you did. Both are possible.
Note that there are differences. See the docs of ?cooccurrence
library(udpipe)
library(data.table)
data("brussels_reviews_anno")
x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr")
## sum of cooccurrence within documents - all words no mather where they are in the document
x <- cooccurrence(x, group = "doc_id", term = "lemma")
x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr")
x <- setDT(x)
## sum of cooccurrences within a sentence - all words no mather where they are in the sentence
x[, cooccurrence(.SD, term = "lemma", group = "sentence_id"), by = list(doc_id)]
## cooccurrence of words following one another
x[, cooccurrence(lemma, skipgram = 0), by = list(doc_id))
I'm a bit confused. If I have a data frame and I want to group it by doc_id
and return the doc_id
column (along with term1, term2, and cooc columns), then shouldn't I be able to use cooccurrence(x, group = "doc_id", term = "lemma")
? Right now the function does not allow for that, as shown in my first example. Is the only way to get the doc_id
column in the output by doing the workaround I illustrated in the second example?
See examples above, really it depends on what you want to compute: does order matter of words or not. See examples above.
Thanks for the examples, they definitely cleared up my confusion. It makes sense that x <- cooccurrence(x, group = "doc_id", term = "lemma")
was indeed performing cooccurrences by document, but then summing them across all of the documents. As a result, the output no longer had a doc_id
column.
My initial assumption was that the group
argument as part of the input would also return the group
argument in the output, but I can see the package was designed a little differently.
Indeed, the group argument is to make sure the cooccurrences are not calculated over different documents but within a document and afterwards aggregated.
Good that this is cleared out.