nicolewhite/RNeo4j

Fuzzy search

danielkrizian opened this issue · 6 comments

Hi Nicole,
just curious, do you intend to develop a fuzzy full-text search function, retrieving list of node matches in descending order based on some distance measure?
http://linkurio.us/ utilizes such a search on top of Neo4j, using elasticsearch technology. There seem to be a elasticsearch package for R now. Maybe integrating that into RNeo4j would be worth considering.
Fuzzy search is quite a common exploration use case.
Thanks, Daniel

Hi Daniel,

That sounds like useful functionality, though I can't say it would be high on my to-do list. I would have to become more familiar with elasticsearch. This would also require support for legacy indexing, which is something I think I should add anyway. You can go ahead and do legacy indexing yourself using RCurl and the directions here: http://docs.neo4j.org/chunked/stable/rest-api-indexes.html

Keep me updated on any progress you make with this. It sounds very interesting and useful but I won't have the time in the near future to implement something like that. But I should be able to add legacy indexing pretty soon.

Nicole

Sure, I would pick that challenge up myself if my curl skills were up to par.
In the meantime, I've just come across two useful links showing curl queries for any reader volunteering to implement:
http://www.sinking.in/blog/seven-databases-neo4j-and-misunderstanding-indexes/
http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/

I managed to get the fuzzy search going. Some rough working prototype (unclean code + should be generalized further):

searchNodes <- function(graph, pattern, label ,fuzzy=TRUE) {
  fuzzy_factor = function(x) {
    # convenience function to set fuzzy tolerance based on pattern string length.
    # Longer strings have higher tolerance.
    above100 = 10*10^(1:5)
    breaks= c(0, 6, 10, 15, above100 )
    factors=c(NA, .7,  .6,  .5,   0.6-0.1*log10(above100))

    f=cut(nchar(x), breaks=breaks, 
          labels=na.omit(factors))

    as.numeric(levels(f))[f]
  }

  fields = c("name, name_long,name_short,name_official,name_common,bbg,aliases")

  spl <- function (s, delim = ',', trim=T) {
    splitted=unlist(strsplit(s,delim))
    gsub("^\\s+|\\s+$", "", splitted)
  }

  keywords = unlist(strsplit(pattern,'[[:punct:]]|[[:space:]]', perl = TRUE))
  keywords = keywords[nchar(keywords)>3]
  fuzzy_keywords = paste0(keywords,"~",fuzzy_factor(keywords), collapse = " AND ")

  lucene = paste0(spl(fields), ":(",fuzzy_keywords, ")", collapse = " OR ")
  query = sprintf("START n=node:node_auto_index('%s') 
                WHERE(n:%s) 
                RETURN n", lucene, label )

  getNodes(graph, query)
}


> searchNodes(graph, pattern="worlt", label="Geography")
[[1]]
Labels: OPERA Geography

$un_m.49
[1] "001"

$name
[1] "World"

$name_OPERA
[1] "Global"

Requires setting up fulltext type auto_index manually via REST beforehand.

Daniel, that is awesome work. When I find some time I'll start playing around with it. Maybe we can do a pull request after some polishing.

The user interface could be generalized even further by dispatching from the existing generic getUniqueNode. The above searchNodes can thus be unexported internal function.

#' Usual use. This is the default exact match on the pattern
getUniqueNode(graph, "MyLabel", name="pattern") 

#' This is fuzzy match on the "pattern" string with similarity factor 0.2 or better. 
#' Retrieves the single, most similar match (as ranked by distance measure)
#' `~` is the R's `formula operator, coinciding happily with Lucene's fuzzy match operator
#' If the function detects formula passed to the property via dots, it will dispatch to the fuzzy search function.
getUniqueNode(graph, "MyLabel", name="pattern" ~0.2)  

This would be better as a pull request.