collectiveidea/awesome_nested_set

Thoughts on parallelizing rebuild to use with bulk upload?

drewB opened this issue · 5 comments

drewB commented

We bulk upload tree data which is very slow out of the box when you are dealing with 20K+ nodes. We were able speed this up a bunch by disabling the callbacks and doing a rebuild after the the inserts (see #326 (comment) for those details). Despite that the actual rebuild is still very slow and we are trying to think of a faster way to do it. Does anyone have any thoughts on how to make it faster? Either improving on what we already developed or allowing for parallel process so we could split up the tree into multiple background jobs.

Below is the scoped rebuild code we are using:

def self.rebuild_tree!(account_id, version)
    # modeled after rebuild! in awesome_nested_set.rb
    # account_id is our scope
    root_nodes = self.where(account_id: account_id, version: version).roots
    indices = {}

    scope = lambda{|node|}
    if acts_as_nested_set_options[:scope]
      scope = lambda{|node|
        scope_column_names.inject(""){|str, column_name|
          str << "AND #{connection.quote_column_name(column_name)} = #{connection.quote(node.send(column_name.to_sym))} "
        }
      }
    end

    set_left_and_rights = lambda do |node|
      # set left
      node[left_column_name] = indices[scope.call(node)] += 1
      # find
      NestSetModel.where(["#{quoted_parent_column_full_name} = ? #{scope.call(node)}", node]).order("#{quoted_left_column_full_name}, #{quoted_right_column_full_name}, id").each{|n| set_left_and_rights.call(n) }
      # set right
      node[right_column_name] = indices[scope.call(node)] += 1
      node.save!(validate: false)
    end

    root_nodes.each do |root_node|
      indices[scope.call(root_node)] ||= 0
      set_left_and_rights.call(root_node)
    end
end

private

def set_zero_left_and_rights
    self[left_column_name] = 0 unless self[left_column_name].present?
    self[right_column_name] = 0 unless self[right_column_name].present?
end
drewB commented

I will say a few ideas I have thought about but not dug into yet:

  1. Use a different sql based algorithm to speed up the rebuild. Something like [this].(https://www.sqlservercentral.com/articles/hierarchies-on-steroids-1-convert-an-adjacency-list-to-nested-sets)
  2. Calculate the lft & rgt values all in memory and then insert them all at once with the lft & rgt values rather than lots of sql updates.

My company has need of this as well, we have multiple large clients who do nightly bulk imports. we need to rebuild everything every day once the imports finish (it takes 10 hours) it would be nice if we could rebuild each client separately.

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

drewB commented

We ended up calculating the lft & rgt values all in memory and then insert them all at once with the lft & rgt values rather than lots of sql updates. Rebuilds that used to take over an hour now take less than 30 seconds.

Here is the code:

class OrgTreeNode < ApplicationRecord

  acts_as_nested_set scope: [:rosterable_type, :rosterable_id, :version]

  def self.rebuild_tree!(rosterable, version)
    root_nodes = rosterable.org_tree_nodes.where(version: version).roots
    calculator = NestedSetCalculator.new(root_nodes.pluck(:id))

    get_parent_and_ids_for_tree(rosterable, version).each do |parent_id, child_id|
      calculator.add_relationship(parent_id, child_id)
    end

    calculator.generate_tree
    ActiveRecord::Base.connection.execute(calculator.update_sql_query)
  end

  def self.get_parent_and_ids_for_tree(rosterable, version)
    rosterable.org_tree_nodes.where(version: version).
      where("parent_id is not null").
      pluck("parent_id", "id")
  end
end

class NestedSetCalculator
  # This is used to calculate the lft and rgt nested_set values
  # for OrgTreeNodes
  # 
  # To use first add all the parent/child relationships using add_relationship
  # Then run generate_tree
  # Finally use update_sql_query to generate a query you can use
  # to set all the values

  attr_accessor :nodes

  def initialize(root_ids = [])
    # format of nodes will be:
    #
    # key = org_tree_node_id
    # value = {children_ids: [], lft: int, rgt: int}

    @nodes = {}
    @current_index = 0
    @root_ids = root_ids
  end

  def add_relationship(parent_id, child_id)
    @nodes[parent_id] = { children_ids: [] } unless @nodes[parent_id]
    @nodes[parent_id][:children_ids] << child_id
  end

  def generate_tree
    @root_ids.each do |id|
      generate_sub_tree(id)
    end
  end

  def update_sql_query
    values_list = @nodes.map do |key, value|
      "(#{key},#{value[:lft]},#{value[:rgt]})"
    end.join(', ')

    <<-SQL
      UPDATE org_tree_nodes SET lft = calculated.lft, rgt = calculated.rgt
      FROM (VALUES #{values_list}) AS calculated (id, lft, rgt)
      WHERE org_tree_nodes.id = calculated.id;
    SQL
  end

  private

  def generate_sub_tree(id)
    @nodes[id] = { children_ids: [] } unless @nodes[id]
    @nodes[id][:lft] = @current_index += 1
    @nodes[id][:children_ids].each do |child_id|
      generate_sub_tree(child_id)
    end
    @nodes[id][:rgt] = @current_index += 1
  end
end
stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.