Consider manual screening interface (i.e. text only)
Opened this issue · 4 comments
revtools provides tools for visualising topic model information, but some users may wish (or be required) to sort articles based on titles or abstracts without including any visual information. A user interface for this would be simple to build, and would provide support for a wider range of users.
Great talk yesterday Martin! This use-case was exactly what sprung to mind.
I wrote a quick function to aggregate and rank documents by similarity.
doc_rank <- function(lda, dtm, select = c(1), method = "term"){
# Combine selected documents
ngroup = length(select)
if(ngroup > 1){
group <- colSums(dtm[select, ])
dtm[select, ] <- rep(group, each = ngroup)
}
# Back-transform LDA coefs.
beta <- exp(lda@beta)
# Weights docs by topic or term x topic
if(method == "topic"){
x <- dtm %*% t(beta)
}
else{
w <- apply(dtm, 1, function(x) x * beta)
x <- t(w)
}
# Calculate cosine dissimilarity
c_dis <- 1 - x %*% t(x) / (sqrt(rowSums(x^2) %*% t(rowSums(x^2))))
# Normalise across docs for symmetrical ranking (?desirable)
d <- as.matrix(dist(c_dis))
# Use first selected doc as reference point
ref = select[1]
# Rank documents
doc_list <- data.frame(doc_id = 1:nrow(dtm), rank = rank(d[ref, ]))
return(doc_list[order(doc_list$rank), ])
}
With a little tweaking to refine the action loop, a typical workflow might be:
Screen title, authors
-> Read abstract
-> Mark if relevant
-> Sort document list
.
which should hopefully bubble the relevant papers to the top.
library(revtools)
file_location <- system.file("extdata",
"avian_ecology_bibliography.ris",
package="revtools")
x <- read_bibliography(file_location)
d <- make_DTM(x)
l <- run_LDA(d)
# Doc 6 is the most similar to 1, Doc 16 the least.
doc_rank(l, d, c(1))
# But if I like Doc 16, I should read Doc 9 next.
doc_rank(l, d, c(16))
Thanks Andrew, I'm glad you liked the talk! This is a great idea; my only caveats are how to:
- update this as the user selects more and more articles, and
- avoid biasing the user away from relevant research that uses different keywords
At the moment, my plan is to add a neural network -based method for prioritising articles in screen_titles or screw_abstracts, probably based on the approach of Roll et al. 2017 (https://onlinelibrary.wiley.com/doi/abs/10.1111/cobi.13044). But that won't be in v0.3.0 as I don't have time to test it right now!
Thanks heaps for the code too - this is a really good start that will help me out a lot.
No problem, I was mostly just playing:
-
The first block
#Combine selected documents
treats all selected documents as a single reference point. So you'd just update after every selection, or have a button to re-sort. -
This is harder. Back transforming the weights means that documents aren't strongly penalised for having a term that isn't associated with a topic, (
beta ~ 0
, instead oflog_beta ~ -9
; you could switch this if you wanted different behaviour). Pooling documents should capture a more diverse vocabulary as you progress and the overall similarity would tend towards the words different documents had in common. Serendipity is difficult code.
But this is far from tested! Be interesting to think about how you'd validate it.
edit: I wonder if you could substract irrelevant documents from the reference group? Not sure what that'd look like, but it might help narrow the search in a more granular manner.
FYI metagear's abstract screener does this already, albeit in a bit of a fiddly and inflexible way. But just to avoid duplicating that function.