mglaman/drupal-typed-data-by-example

Explain how TypeData can satisfy a Data Sync story

Opened this issue · 8 comments

Stop me if you've heard this before.

I have products in my Drupal site, and products in another (maybe not Drupal) site. I want to run a process and sync the product data. Over time, I've learned that you can't just fire and forget that process. There is a strong need to have a series of reports that shows data that isn't properly synced, as well as vigorously report on any issues with the sync so we can fix whatever issue is leading to the data not syncing well. So that's the use case.

When creating data sync reports it's important to answer the following questions:

  • Is there any data in datasource A that is not in datasource B?
  • Is there any data in datasource B that is not in datasource A?
  • When I do an operation (create/update/disable/delete), I need to record detailed error messages / validation errors (from either side/source) if anything goes wrong so I can figure out how to correct it.

and maybe I could optimize performance by doing these checks during the process:

  • Create/Update logic, Does my product in datasource A exist in datasource B?
  • Skip Update logic, Is my product in datasource A have any updates for the product in datasource B?

There are many programmatic approaches to solving the above questions. It would be nice to see a solution that used TypeData. I am imagining a solution that takes data from each datasource and converts them into a common data type so that:

  • direct comparisons can be easier
  • the size of the intermediate state could be much smaller that a fully hydrated node
  • ultimately be less code because of inherited getters/setters
  • maybe easier to write code that handles validation of individual properties.

If you've got something like this already thought through, I'd say run with that. If not, I'm eager to help write some documentation on how to do this...as soon as I figure it out.

I'll spend some time to give a proper reply, but at a quick skim, here's a thought/question.

Your concern is about sourcing data and ensuring it's consistency.

The Typed Data API would be used at the granular level of each object being processed. Not the level at which it is sourced and then stored.

Like I said, I'll read again for a more in depth reply.

I think I understand what you're saying, and I think that's fine.

When considering Data Warehouse based data analysis and reporting best practices, there is value in merely "transporting" data from a remote datasource to a local datasource. If you are debugging a full ETTL (Extract, Transport, Transform, Load) process it's important to know where data quality issues occur. If the problem was with the original data, then you can fix the problem at the remote (like how data is being captured). But if the problem really is how you are improperly transforming the data then you need to fix it on your end.

All that said, I think TypedData can help by converting the remote and the local data into a common intermediate state that retains all the data available and any calculated or extra fields needed to help make reporting decisions. In order to fit the specific reports we want, the data will be transformed again to become the associative array we can give to a #table or #tableselect element. For performance, could cache that intermediate state so that multiple reports can use the same data without needing to redo API calls or expensive Node->loads.

I recently build just such a process for a project, I'm just now starting to think about how to modify it to include TypedData. With everything else I have going on I think I might have a week early in the year to explore, then will have to move onto other tasks.

If you are debugging a full ETTL (Extract, Transport, Transform, Load) process it's important to know where data quality issues occur. If the problem was with the original data, then you can fix the problem at the remote (like how data is being captured). But if the problem really is how you are improperly transforming the data then you need to fix it on your end.

:) This is something I'm working on but it's not public, yet.

You should have the Transform as part of your integration with Typed Data API. Create a data definition that maps your source expectations and your destination expectations. This way you can do validation against the source value to your data definition and see if it's expected. Ditto for the destination.

That's one problem with Migrate, right now, is validation and error handling.

Source your data however you need. Convert it from JSON, CSV, YAML, etc to an array. Then write a schema for what the object looks like in each "row". Perform validations on each object/row. Then you can find errors more easily and maybe skip processing of that one and continue forward in your processing pipeline.

OK, it sounds like my thinking is going down the right path then. I wish I could like, upload a diagram of the flow to help explain what I've made so far.

  • I have the part that extracts json from the remote, and nodes from the local Drupal site.
  • I am converting both into a PHP object I call Datarow. I have my getters and setters there. I handle cases where only local or remote data is available. It enables me to properly group data, so I can spit the data out into chunks for different reports. And the object can convert the compiled data into an array so I can use it for #table or #tableselect elements.

Next Steps:

  • Use the examples you have here to add in schema for the Datarow object.
  • Use TypeData for each of the properties, ensuring the proper data type is used for each.
  • Use the Datarow object for all local data (instead of caching whole nodes), try to cut out any need for going back to the node for more data.

Expected Goals:

  • Reduced code due to cutting out all the getters and setters.
  • Better avenues for handling validation instead of so many custom written simple validation methods.
  • Get more comfortable with the TypeData API so I can use for more complicated problems in the future.

Getting back the original feature request. I think I can help building some code to help explain the use case.

Key bits along the way would include:

  1. Pull product data into a TypeData enhanced object (I suggest we call it Datarow and not Dimension otherwise we might confuse the data warehouse nerds)
  2. Show code that pulls data from a remote endpoint into the same object.
  3. Show examples of comparison logic, and the helpful book-keeping properties that we would want to include: (source_name, destination_report, and others)

Being able to show a use case such as this may be helpful to the migrate developers. It might spark an idea for them to include support for custom reporting on migration runs.

Do you have a "Map" example sounds like that needs to be handled with special attention. Sounds like it's close the array situation I'm looking for.