Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider manual screening interface (i.e. text only) #12

Open
mjwestgate opened this issue Jul 6, 2018 · 4 comments
Open

Consider manual screening interface (i.e. text only) #12

mjwestgate opened this issue Jul 6, 2018 · 4 comments

Comments

@mjwestgate
Copy link
Owner

revtools provides tools for visualising topic model information, but some users may wish (or be required) to sort articles based on titles or abstracts without including any visual information. A user interface for this would be simple to build, and would provide support for a wider range of users.

@aornugent
Copy link

aornugent commented Aug 17, 2018

Great talk yesterday Martin! This use-case was exactly what sprung to mind.

I wrote a quick function to aggregate and rank documents by similarity.

doc_rank <- function(lda, dtm, select = c(1), method = "term"){
  
  # Combine selected documents
  ngroup = length(select)
  if(ngroup > 1){
    group <- colSums(dtm[select, ])
    dtm[select, ] <- rep(group, each = ngroup)
  }
  
  # Back-transform LDA coefs.
  beta <- exp(lda@beta)
  
  # Weights docs by topic or term x topic
  if(method == "topic"){
    x <- dtm %*% t(beta)
  }
  else{
    w <- apply(dtm, 1, function(x) x * beta)
    x <- t(w)
  }
    
  # Calculate cosine dissimilarity
  c_dis <- 1 - x %*% t(x) / (sqrt(rowSums(x^2) %*% t(rowSums(x^2))))
  
  # Normalise across docs for symmetrical ranking (?desirable)
  d <- as.matrix(dist(c_dis))
  
  # Use first selected doc as reference point
  ref = select[1]
  
  # Rank documents
  doc_list <- data.frame(doc_id = 1:nrow(dtm), rank = rank(d[ref, ]))
  
  return(doc_list[order(doc_list$rank), ])
}

With a little tweaking to refine the action loop, a typical workflow might be:

Screen title, authors -> Read abstract -> Mark if relevant -> Sort document list.

which should hopefully bubble the relevant papers to the top.

library(revtools)

file_location <- system.file("extdata",
  "avian_ecology_bibliography.ris",
  package="revtools")

x <- read_bibliography(file_location)

d <- make_DTM(x)
l <- run_LDA(d)

# Doc 6 is the most similar to 1, Doc 16 the least.
doc_rank(l, d, c(1))

# But if I like Doc 16, I should read Doc 9 next.
doc_rank(l, d, c(16))

@mjwestgate
Copy link
Owner Author

Thanks Andrew, I'm glad you liked the talk! This is a great idea; my only caveats are how to:

  1. update this as the user selects more and more articles, and
  2. avoid biasing the user away from relevant research that uses different keywords

At the moment, my plan is to add a neural network -based method for prioritising articles in screen_titles or screw_abstracts, probably based on the approach of Roll et al. 2017 (https://onlinelibrary.wiley.com/doi/abs/10.1111/cobi.13044). But that won't be in v0.3.0 as I don't have time to test it right now!

Thanks heaps for the code too - this is a really good start that will help me out a lot.

@aornugent
Copy link

aornugent commented Aug 21, 2018

No problem, I was mostly just playing:

  1. The first block #Combine selected documents treats all selected documents as a single reference point. So you'd just update after every selection, or have a button to re-sort.

  2. This is harder. Back transforming the weights means that documents aren't strongly penalised for having a term that isn't associated with a topic, (beta ~ 0, instead of log_beta ~ -9; you could switch this if you wanted different behaviour). Pooling documents should capture a more diverse vocabulary as you progress and the overall similarity would tend towards the words different documents had in common. Serendipity is difficult code.

But this is far from tested! Be interesting to think about how you'd validate it.

edit: I wonder if you could substract irrelevant documents from the reference group? Not sure what that'd look like, but it might help narrow the search in a more granular manner.

@befriendabacterium
Copy link

FYI metagear's abstract screener does this already, albeit in a bit of a fiddly and inflexible way. But just to avoid duplicating that function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants