Thoughts on parallelizing rebuild to use with bulk upload?
drewB opened this issue · 5 comments
We bulk upload tree data which is very slow out of the box when you are dealing with 20K+ nodes. We were able speed this up a bunch by disabling the callbacks and doing a rebuild after the the inserts (see #326 (comment) for those details). Despite that the actual rebuild is still very slow and we are trying to think of a faster way to do it. Does anyone have any thoughts on how to make it faster? Either improving on what we already developed or allowing for parallel process so we could split up the tree into multiple background jobs.
Below is the scoped rebuild code we are using:
def self.rebuild_tree!(account_id, version)
# modeled after rebuild! in awesome_nested_set.rb
# account_id is our scope
root_nodes = self.where(account_id: account_id, version: version).roots
indices = {}
scope = lambda{|node|}
if acts_as_nested_set_options[:scope]
scope = lambda{|node|
scope_column_names.inject(""){|str, column_name|
str << "AND #{connection.quote_column_name(column_name)} = #{connection.quote(node.send(column_name.to_sym))} "
}
}
end
set_left_and_rights = lambda do |node|
# set left
node[left_column_name] = indices[scope.call(node)] += 1
# find
NestSetModel.where(["#{quoted_parent_column_full_name} = ? #{scope.call(node)}", node]).order("#{quoted_left_column_full_name}, #{quoted_right_column_full_name}, id").each{|n| set_left_and_rights.call(n) }
# set right
node[right_column_name] = indices[scope.call(node)] += 1
node.save!(validate: false)
end
root_nodes.each do |root_node|
indices[scope.call(root_node)] ||= 0
set_left_and_rights.call(root_node)
end
end
private
def set_zero_left_and_rights
self[left_column_name] = 0 unless self[left_column_name].present?
self[right_column_name] = 0 unless self[right_column_name].present?
end
I will say a few ideas I have thought about but not dug into yet:
- Use a different sql based algorithm to speed up the rebuild. Something like [this].(https://www.sqlservercentral.com/articles/hierarchies-on-steroids-1-convert-an-adjacency-list-to-nested-sets)
- Calculate the lft & rgt values all in memory and then insert them all at once with the lft & rgt values rather than lots of sql updates.
My company has need of this as well, we have multiple large clients who do nightly bulk imports. we need to rebuild everything every day once the imports finish (it takes 10 hours) it would be nice if we could rebuild each client separately.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We ended up calculating the lft & rgt values all in memory and then insert them all at once with the lft & rgt values rather than lots of sql updates. Rebuilds that used to take over an hour now take less than 30 seconds.
Here is the code:
class OrgTreeNode < ApplicationRecord
acts_as_nested_set scope: [:rosterable_type, :rosterable_id, :version]
def self.rebuild_tree!(rosterable, version)
root_nodes = rosterable.org_tree_nodes.where(version: version).roots
calculator = NestedSetCalculator.new(root_nodes.pluck(:id))
get_parent_and_ids_for_tree(rosterable, version).each do |parent_id, child_id|
calculator.add_relationship(parent_id, child_id)
end
calculator.generate_tree
ActiveRecord::Base.connection.execute(calculator.update_sql_query)
end
def self.get_parent_and_ids_for_tree(rosterable, version)
rosterable.org_tree_nodes.where(version: version).
where("parent_id is not null").
pluck("parent_id", "id")
end
end
class NestedSetCalculator
# This is used to calculate the lft and rgt nested_set values
# for OrgTreeNodes
#
# To use first add all the parent/child relationships using add_relationship
# Then run generate_tree
# Finally use update_sql_query to generate a query you can use
# to set all the values
attr_accessor :nodes
def initialize(root_ids = [])
# format of nodes will be:
#
# key = org_tree_node_id
# value = {children_ids: [], lft: int, rgt: int}
@nodes = {}
@current_index = 0
@root_ids = root_ids
end
def add_relationship(parent_id, child_id)
@nodes[parent_id] = { children_ids: [] } unless @nodes[parent_id]
@nodes[parent_id][:children_ids] << child_id
end
def generate_tree
@root_ids.each do |id|
generate_sub_tree(id)
end
end
def update_sql_query
values_list = @nodes.map do |key, value|
"(#{key},#{value[:lft]},#{value[:rgt]})"
end.join(', ')
<<-SQL
UPDATE org_tree_nodes SET lft = calculated.lft, rgt = calculated.rgt
FROM (VALUES #{values_list}) AS calculated (id, lft, rgt)
WHERE org_tree_nodes.id = calculated.id;
SQL
end
private
def generate_sub_tree(id)
@nodes[id] = { children_ids: [] } unless @nodes[id]
@nodes[id][:lft] = @current_index += 1
@nodes[id][:children_ids].each do |child_id|
generate_sub_tree(child_id)
end
@nodes[id][:rgt] = @current_index += 1
end
end
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.