clowne-rb/clowne

Nested include_association duplication

Opened this issue · 7 comments

Hello.
In my app I have a models scheme similar to the next:

class Table
  has_many :rows
  has_many :columns
end

class Row
  belongs_to :table
  has_many :cells
end

class Column
  belongs_to :table
  has_many :cells
end

class Cell
  belongs_to :row
  belongs_to :column
end

and cloners:

class TableCloner < Clowne::Cloner
  adapter :active_record

  include_association :rows
  include_association :columns

  finalize do |source, record, **params|
    record.title = "Copy of #{source.title}"
  end
end

class RowCloner < Clowne::Cloner
  adapter :active_record

  include_association :cells
end

class ColumnCloner < Clowne::Cloner
  adapter :active_record

  include_association :cells
end

class CellCloner < Clowne::Cloner
  adapter :active_record
end

Unfortunately, when I try to clone the Table

operation = TableCloner.call(original_table)
operation.persist!

I get correct table, rows and columns, but cells are cloned twice and it seems that both are wrong:
one cells set has wrong row_id from the original_table rows and correct column_id from the cloned_table columns
and the second cells set has correct row_id from the cloned_table rows but wrong column_id from the original_table columns

It seems that it doesn't understand how to sync all these record cloners (two dimensional table) and does clone rows and columns associations independent from each other.
Could you help me to understand how to clone such a scheme properly?

Thank you

clowne 1.4.0
ruby 3.0.2
rails 6.1.4

Didn't find anything better than just to re-fill cell.column_id in the CellCloner:

after_persist do |source, clone, mapper:, **|
  if source.column_id
    column = mapper.clone_of(source.column)
    clone.update(column_id: column.id)
  end
end

Which does not sound as a best practice and produces a lot of needless SQL queries.

Would love to hear about best practices for this question.

It's not uncommon to have application schemas were child records have a belongs-to relationship to two or more parent records (like the Cell example, belonging to both Column and Row).

hi @nicbet! sorry for the late response; I need some time to figure out best solution here; so I'll return to you later ;)

First, I should mention that the design of DB when tables reference in a circle does not look quite good; it could be a sign that something is incorrect in DB design;

             columns --> tables <-- rows
               ^-------- cells ------^

There is no guarantee that the left (row) and right (cell) chain of relations are always equal:

a_table.rows.flat_map(&:cells) == a_table.columns.flat_map(&:cells)

I guess the restriction of correctness relation places someone on the app level, but not on the DB;

Anyway, let's consider how to clone this;

Option#1: Clone cells on left or right relation and then restore the rest one

As @varg90 mentioned, we have a mapper that can help restore relations #69 (comment)

This solution works but does not scale - at every after_persist UPDATE will be triggered; Depending on using transaction & chosen DB & count of records it could lead to significant performance problems;

Option#2: Clone cells separately and set correct FKs using preparedmapper

Here is an example:

Cloner:

class CellCloner < Clowne::Cloner
adapter :active_record
finalize do |source, record, **params|
mapper = params[:mapper]
# remap cells to cloned rows and columns
record.row_id = mapper.clone_of(AR::Row.new(id: source.row_id)).id
record.column_id = mapper.clone_of(AR::Column.new(id: source.column_id)).id
end
end

Usage:

ActiveRecord::Base.transaction do
cloned_table.save!
mapper = operation.mapper
# replace with your cells selection logic:
# cells = Cell.all or
# cells = table.rows.flat_map(&:cells) or
# cells = table.columns.flat_map(&:cells) or
# cells = Cell.where(row_id: table.rows.pluck(:id)) etc.
cells.each do |cell|
AR::CellCloner.call(cell, mapper: mapper).to_record.save!
end
end

NOTE: The solution allows you to fully control what cells get copied. As I mentioned above, the DB design allows different sets of cells on left and right "relation branches"

I believe the solution could be improved using AR bulk save

@nicbet how do you think Option#2 works for you?

Thanks for the detailed write-up!

the design of DB when tables reference in a circle does not look quite good; it could be a sign that something is incorrect in DB design

it's sometimes unavoidable, especially when you are working with variable tabular data where the number and/or types of columns/rows is not known a-priori - for example when row and column data are created by a user at runtime - any instance of a data grid / data table.

However, I'd argue that the relationship model is not circular: a Cell model having two foreign keys (belongs to): one to a Row model, one to a Column model, where each of the Row and Column models reference upwards to a Table model is purely directional upwards. The relationship arrows do not go around in a circle and data integrity is enforced through the foreign keys.

Algorithmically, the data model in such cases is described by a Directed Acyclic Graph (DAG). I believe that's where the point of contention arises: a Cloner assumes that the data model is described by a Tree and will accidentally duplicate records during DFS-traversal for any data model that is not a tree.

From a high level perspective, it lacks the concept of a "memory" of visited records to prevent such duplication.

Conceptually an example approach to cloning a data model described by a DAG could be:

  • Consider the data model as a graph spanned by the model to be cloned and all transitively associated models. In that graph, records are nodes and edges are created by the has_many relationships.
  • Initialize an empty hash map to keep track of the visited records and their clones.
  • Perform a depth-first-search traversal of that graph.
  • For each visited node, create a clone of the node and remember it in the hash map.
  • For each associated record (outgoing edge) of the visited node, check if the associated record (destination node) is already in the hash map.
  • If it is, associate the corresponding clone to the clone of the visited node. If it is not, perform a DFS traversal on the destination node and repeat the previous steps.
  • Once the DFS traversal is complete, return the clone of the starting node.

Option#2: Clone cells separately and set correct FKs using preparedmapper

Option 2 introduces this missing memory into the operation through the mapper, and is a working solution, although it requires an understanding of the internals of Clowne from a developer mental model.

From a DX perspective it would be amazing if Clowne could internally support DAG-type data models out of the box - but that may be a feature for another day.

If Option 2 is what's officially documented and prescribed - that works for me 👍

how do you think Option#2 works for you?

I've actually just implemented and tested Option 2 above in a real-world application with two observations:

  1. the block inserting all Cells from the mapper should probably be wrapped in a transaction
  2. performance-wise for real-world tables (a few hundred "Cells") the clone operation probably needs to move into a background job

Thanks again for taking the time to preparing a solution and such a detailed answer.

it's sometimes unavoidable, especially when you are working with variable tabular data where the number and/or types of columns/rows is not known a-priori - for example when row and column data are created by a user at runtime - any instance of a data grid / data table.

I see your point: Agreed

Conceptually an example approach to cloning a data model described by a DAG could be:

Consider the data model as a graph spanned by the model to be cloned and all transitively associated models. In that graph, records are nodes and edges are created by the has_many relationships.
Initialize an empty hash map to keep track of the visited records and their clones.
Perform a depth-first-search traversal of that graph.
For each visited node, create a clone of the node and remember it in the hash map.
For each associated record (outgoing edge) of the visited node, check if the associated record (destination node) is already in the hash map.
If it is, associate the corresponding clone to the clone of the visited node. If it is not, perform a DFS traversal on the destination node and repeat the previous steps.
Once the DFS traversal is complete, return the clone of the starting node.

Thanks for sharing your ideas; this is quite an elegant solution; I'm thinking how this can be added without major interventions in the code base;
Basically, the idea is to use mapper during the cloning process and use cloned relation instead of always doing dup the relation; Because mapper is the hash map you mentioned above;

At the same time, I prefer explicit behaviour; What do you think if we introduce a new DSL?

restore_association <name>, not_found_strategy: [:fail, :skip, :clone], (default :skip)

where

fail - raise an exception if there is no already "cloned" relation
skip - put null object if there is no already "cloned" relation
clone - clone relation if there is no already "cloned" relation

Not sure if we need it for something except belongs_to