gernotkogler/xapian_db

Add support for faceting multi-valued attributes

Closed this issue · 3 comments

For instance, a Person with an array of tags. I believe each tag is being indexed at the moment – the issue is when faceting.

I'm on #xapian to discuss.

That's right. If you have a tags method on Person that returns an array of strings, each string becomes a term that you can search for, e.g.

person.tags = %w(ruby rails)
person.save
doc = Person.search "rails" # -> document representing person

However, attributes are stored "as is", meaning doc.tags == ["ruby", "rails"]. Since a facet search simply groups the hits by the values of a given attribute, Person.facets :tags, "ruby" therefore returns { ["ruby", "rails"] => 1 }, not { "ruby" => 1 } as you might expect.

This is how the collapse keys-option in Xapian works (see Xapian API docs). If you have a suggestion how to improve facet queries in xapian_db, please let me know. A pull request would be even better :-)

I've just been bitten by this while trying to move from https://github.com/ryanb/xapit (which does support it), in particular storing tags for articles.

I'm trying to think of solutions. What is running through my head at the moment is to store the array of tags in some known structure that allows searching for them later (because I can't search for them at the moment because they're stored as a JSON block). Something like this:

object1.tags = ["A", "B"]
# => to Xapian as a string field with a 0 weight like " | A | B | "
object2.tags = ["A", "C"]
# => to Xapian as a string field with a 0 weight like " | A | C | "

Then that will come back as a string when calling MyClass.facets(:tags, "something") as an array of facets with a key as above. I can then restore the counts to a list of tags with counts (rather than unique combined values) using something like (rough pseudo-code, untested):

facets = Hash.new(0)
returned_facets.each do |f|
  tags = f.first.split(" | ").compact
  count = f.second
  tags.each do |tag|
    facets[tag] += count
  end
end

Then when I want to search items containing "something" and tagged with "A" for them I can do something like:

MyClass.search('something AND tags:contain("A"')
# Which internally gets converted to something like => MyClass.search('something AND tags:*" | A | "*')

Does that make sense? I'm going to try implementing this app side, but if it works, then it may have potential for me to try to build it in to XapianDb and send a pull request. Something like:

XapianDb::DocumentBlueprint.setup(:Article) do |blueprint|
  blueprint.attribute :title
  blueprint.attribute :tags, multi_value: true
end

as a way of doing it from a library user's point of view, which then will internally do the concatenation and the extraction of the results as above.

What are you thoughts?

This proof of concept seems to work:

require 'rubygems'
require 'date'
require 'xapian_db'

puts "Setting up the demo..."

XapianDb::Config.setup do |config|
  config.adapter :generic
  config.enable_query_flag Xapian::QueryParser::FLAG_PHRASE
  config.enable_query_flag Xapian::QueryParser::FLAG_SPELLING_CORRECTION
  config.enable_query_flag Xapian::QueryParser::FLAG_WILDCARD
  config.enable_query_flag Xapian::QueryParser::FLAG_BOOLEAN
  config.enable_query_flag Xapian::QueryParser::FLAG_BOOLEAN_ANY_CASE
end

# 1: Open an in memory database
db = XapianDb.create_db

# 2: Define a class which should get indexed; we define a class that
# could be an ActiveRecord or Datamapper Domain class
class Article

  attr_accessor :id, :title, :tags

  def initialize(data)
    @id, @title, @tags = data[:id], data[:title], data[:tags]
  end

  def tags
    " | " + @tags.join(" | ") + " | "
  end

end

# 3: Configure the generic adapter with a unique key expression
XapianDb::Adapters::GenericAdapter.unique_key do
  "#{self.class}-#{self.id}"
end

# 4: Define a document blueprint for our class; the blueprint describes
# the structure of all documents for our class. Attribute values can
# be accessed later for each retrieved doc. Attributes are indexed
# by default.
XapianDb::DocumentBlueprint.setup(:Article) do |blueprint|
  blueprint.attribute :title
  blueprint.attribute :tags
end

# 5: Let's create some objects
article_1 = Article.new id: 1, title: "How to tie a tie", tags: ["grooming", "clothes"]
article_2 = Article.new id: 2, title: "How to shave",     tags: ["grooming", "hair"]

# 6: Now add them to the database
blueprint = XapianDb::DocumentBlueprint.blueprint_for(:Article)
indexer   = XapianDb::Indexer.new(db, blueprint)
db.store_doc(indexer.build_document_for(article_1))
db.store_doc(indexer.build_document_for(article_2))

# 7: Define a method for convering a string of tags back to a list
def list_to_tags(value)
  value.split(" | ").reject { |key| key == "" }
end

# 8: Get the facets
facets = db.facets(:tags, "how")
calculated_facets = Hash.new(0)
facets.each do |k,v|
  list_to_tags(k).each do |tag|
    calculated_facets[tag] += v
  end
end
puts calculated_facets

# 9: Now let's search for "how" articles with a tag of hair
results = db.search 'how AND tags:"| hair |"'
doc  = results.first
puts "title: #{doc.title}"
puts "tags: #{list_to_tags(doc.tags).join(", ")}"