/gsdiff

A tool for diffing data stores

Apache License 2.0Apache-2.0

gsdiff

gsdiff is a diff tool that works across data stores through a global scope.

Suppose the organization has a data lake that comprises various kinds of data stores (e.g. databases, data warehouses, files, and directory services) and possibly cloud data lakes. All of these data stores have their own formats. Therefore, the data lake would be facing a common scenario that the shared data would need to be stored in one centralized location for regulatory compliance reasons.

Take these records from data stores as an example: (For simplicity, a data munging process is at first performed to transform them into JSON-LD as shown below)

{
  "@context":
  {
    "@id": "http://xmlns.com/foaf/spec/Organization",
    "@type": "@id",
    "member": [
      {
        "@id": "http://xmlns.com/foaf/spec/Person",
        "@type": "@id",
        "name": "Gary Sit",
        "mbox": "mailto:garyxsit@gmail.com",
        "phone": "tel:+852.23456789",
        "interest": "http://en.wikipedia.org/wiki/Semantic_Web"
      }
    ]
  }
}
{
  "@context":
  {
    "@id": "http://schema.org/Organization",
    "@type": "@id",
    "member": [
      {
        "@id": "http://schema.org/Person",
        "@type": "@id",
        "name": "Gary Sit",
        "email": "mailto:garyxsit@gmail.com",
        "telephone": [
                       "+852 12345678",
                       "+1-617-253-1000"
                     ],
        "address":
        {
          "@type": "PostalAddress",
          "addressLocality": "Hong Kong",
          "addressRegion": "HK"
        }
      }
    ]
  }
}

These 2 data records differ in the following:

  1. The FOAF one has an extra property interest

  2. The schema.org one has an extra property address

  3. The property names (mbox for FOAF <=> email for schema.org & phone for FOAF <=> telephone for schema.org) and their values (tel: scheme for FOAF <=> formatted strings for schema.org) are different

  4. The schema.org one included a new phone number

  5. Either one of the actual data isn’t updated (e.g. an employee in the organization switched to use another phone number in a country)

This tool solves the above problem by comparing data records between data sources to recommend data discrepancies using a familiar SQL syntax with a similarity measure:

SELECT * FROM origin
WHERE get_similarity(Oracle.metadata, Cassandra.metadata) >= 0.95

This similarity measure uses statistical models (in the domain of Data Mining ∩ Machine Learning ∩ Deep Learning ∩ Data Science) to compare 2 inputs and generate a report in YAML when they’re highly similar. Here’s the sample output:

version: v2
discrepancies:
  variations:
    - property:
        item1:
          name:  "@context.@id" # equals item2.name
          value: http://xmlns.com/foaf/spec/Organization
        item2:
          name:  "@context.@id" # equals item1.name
          value: http://schema.org/Organization
    - property:
        item1:
          name:  "@context.member[:1].@id" # equals item2.name
          value: http://xmlns.com/foaf/spec/Person
        item2:
          name:  "@context.member[:1].@id" # equals item1.name
          value: http://schema.org/Person
    - property:
        item1:
          name:  "@context.member[:1].mbox"
          value: mailto:data.masked@gmail.com # equals item2.value
        item2:
          name:  "@context.member[:1].email"
          value: mailto:data.masked@gmail.com # equals item1.value
    - property:
        item1:
          name:  "@context.member[:1].phone"
          value: tel:+*5*.*3*5*7*9 # data are masked
        item2:
          name:  "@context.member[:1].telephone"
          value: [
                   +*5* *2*4*6*8,  # data are masked
                   +1-*1*-*5*-*0*0 # data are masked
                 ]
  additions:
    - property:
        item:
          name:  "@context.member[:1].interest"
          value: http://en.wikipedia.org/wiki/Semantic_Web
    - property:
        item:
          name:  "@context.member[:1].address"
          value: {
                   "@type": PostalAddress,
                   addressLocality: Hong Kong,
                   addressRegion: HK
                 }

The values of telephone and address properties are essentially in JSON formats, and thus they can be embedded in YAML as the output.

Using this tool enables users to focus on business activities (from data wrangling & data masking to encryption processes) instead of manually searching for common data models across data stores in the organization.

Currently it supports these categories:

  • RDBMS (MySQL, PostgreSQL, Oracle, Microsoft SQL Server)

  • NoSQL (Cassandra, MongoDB, Elasticsearch)

  • Data Warehouse (Apache Hive)

  • Tabular data formats (.csv, .tsv) and Excel formats (.xls, .xlsx)

  • Directory Service (Active Directory, Azure AD DS[1])

[1] Note: The implementation uses RESTful Reporting APIs of Azure AD that are available from Microsoft to retrieve data in Azure AD DS and is intended for technical reporting and research purposes only.

The source code will be provided upon request if interested.

Quick Q&A Excerpts

The FOAF namespace is "http://xmlns.com/foaf/0.1/".

That’s true. These JSON-LD @id keywords also recognize an IRI including the namespace, although it sounds it must use the URL related to the namespace at first glance.

Data warehouses don’t belong to the data store and they’re still competing against data lakes.

It depends on the context. Please read the data warehouse description from IBM:
The data store in general has several meanings. Its narrow description can be an Operational Data Store instance, which is in the data warehouse, and its broad description can be a category of a logical data storage representation from data storage devices, and thus a data warehouse would be in the data store category in this sense.
Sometimes it might be interpreted as the data lake has a data warehouse, but it doesn’t indicate there exists a transitive relation in this case. Just like a situation that even though Alice has a child named Bob and there is a child named Carol, it doesn’t mean Alice has Carol as her child.
As for data lakes, probably because many reports described these 2 terms for comparisons (the data warehouse vs. the data lake) and arouse a misunderstanding that they are mutually exclusive. These, however, aren’t implying only one of them can be used. In practice, most of the modern cloud architectures involve a data lake as the data ingestion layer in front of a data warehouse for buffering because of the nature of using the secondary storage in a data warehouse for archival purposes.
To sum up, Wikipedia doesn’t represent everything and don’t judge before knowing the known unknowns and the unknown unknowns.

Syntax errors in your SQL.

The code snippet is a string having SQL syntactic keywords (not yet for binding semantics) passed from an actual parameter to a formal parameter that can be parsed by the tool for the processing. It is relatively simpler than one of the corresponding SQL representations:

SELECT * FROM origin
WHERE get_similarity(Oracle.metadata, Cassandra.metadata) >= 0.95
SELECT * FROM origin
FULL OUTER JOIN Oracle.metadata ON origin.id = Oracle.metadata.id
FULL OUTER JOIN Cassandra.metadata ON Oracle.metadata.id = Cassandra.metadata.id
WHERE get_similarity(Oracle.metadata, Cassandra.metadata) >= 0.95
Ok.

The idea is to reuse SQL keywords as an input and then use graph processing libraries for traversing metadata, and thus it is natural to use these combinations to handle large amounts of data for comparisons at the same time.
Update: Just wrote a primitive transpiler to translate the code based on type inference mechanisms so the multi-table join SQL also works now.

SQL cannot “elegantly” handle traversing a graph.

As of 2017, that was true. It is because SQL semantics in the perspective of functions don’t own sufficient vocabularies to express all of the graph processing capabilities. It doesn’t mean traversal operations cannot be built on the SQL layer. The connections between SQL syntaxes and graph semantics in the perspective of functions become closer and can be mapped correspondingly when the grammar does not take its context into account. Therefore, several products have been available in the market, although they tend to create their keywords as SQL extensions eventually. After all, it is already in the 3rd decade of the 21st century now.

You mentioned there are differences between syntaxes and semantics from the input. In which cases, would they be the same without concerning the context?

That is a metaphor and we all know the corresponding meaning in their literal uses are different. However, there is at least one case that would make them appear the same. One of the cases is most of the existing databases would report semantic errors as syntax errors in error messages. It could then be one of the possibilities that this would be interpreted as semantics are equal to syntaxes. Another one is when searching the synonyms for syntactic on the web and the word semantic will be listed as one of the synonym items of the syntactic. They are comparatively similar in these conditions since the actual meaning of the word synonymous is close to similar, but neither equal nor identical.

What you said is a simile rather than a metaphor.

It is a metaphor since the phrase as long as does not mean nearly in the same length as in a simile literally. It is true that this can be interpreted as another meaning as phrases are frequently referred to a meaning that is different from its literal meaning, and thus it is rephrased and should be clarified now.

Do you know as long as has 3 words?

I know. However, it was previously referred to the situation whether the sentence is a simile or a metaphor and it means there is an assumption that have been made on its definition before asking this question. Let us look into details:
There are mainly 2 definitions for a simile on the web that created the misunderstanding. One of them is having as and another one is having as ... as.
If it was referred to as before asking, then it was interpreted literally. However, that is one of the sufficient conditions and there are other necessary conditions for this case. That is to say, having as is one of conditions to understand it can be grouped as a simile when all of the necessary conditions are satisfied. The key to conclude whether it would be a simile or a metaphor is its purpose that is used to compare and convey similarities between items when there is an assumption that they are similar for the simile or is used to compare and convey whether they are similar for cases having an assumption that they should be different. Therefore, to conclude it is a simile by finding whether it has an as is not enough.
If it was referred to as long as before asking this question, then there is a contradiction here because it was previously made an assumption that it was interpreted as phrases and then the question is asked after making this assumption. It means the assumption is changed from interpreting sentences by phrases to words at this moment. Therefore, the assumption made before has been switched when asking the question from the above.
Therefore, neither of these cases can conclude it is a simile because looking into as or as ... as is only one of the sufficient conditions. That is also one of the reasons that metaphors can only be concluded by concerning its purpose at the same time, because not all metaphors are in the form of having a subject with a predicate where the predicate can be started with an is and its varied form from its tenses and corresponding meaning literally and is connected with a noun or an adjective as this pattern is barely one of its forms, just as a multi-table join SQL can actually be written in other representations that do not mention JOIN keywords.
In spite of this, it is true that this would create confusions. Therefore, it is now updated so that we can aim at focusing on technical communications.

The report doesn’t output discrepancies for @id.

There will be collision issues on the property name if it includes @id and uses the previous scheme. Therefore, it is now updated to use an expression based on JSONPath as the string instead.

How do you handle data sources without data headers?

Using Data Mining on the content and its structure of a data record since the tool will read the data in the matrix form anyway. After getting the above information it in turn uses the aforementioned similarity measure to advise and classify the data into a type based on the knowledge base of a posteriori (e.g. column headers are combined as a type of the personal information) as this process would make post-processings efficient.
Update: It is refactored to:

  • Enable transparent mappings of XML and JSON for common scenarios in the industry

  • Catalog metadata per data source instance in a centralized metastore using a pub/sub mechanism so they can be updated dynamically when it is triggered by the SQL input that can be run in a CLI or in a query editor of a web UI

  • Support the schema-on-read handling with configurable schema versioning properties[2] including the support of type aliases for structures and namespace mapping functions to discover insertions of data anomalies from the knowledge base for follow-ups

  • Implement a mechanism similar to select statement syntaxes of the federated queries from Amazon Athena & Redshift Spectrum to compare internal data with excerpts of WWW data and diff one page and another page on the web

[2] Note: When enabling schema versioning properties with the support of type aliases for structures and namespace mapping functions, it provides a feature to retain existing queries and search useful information from the knowledge base without specifying the namespace for shaping the initial step of building queries for the knowledge discovery.

Pub/sub is for the separation of concerns among distributed systems but you said those data are stored in a centralized metastore. Didn’t they conflict with each other?

That was referring to the publish-subscribe pattern to be precise. Basically, it is a messaging pattern that involves a message queue component to use the same set of information (e.g. topic, content, or both) for transferring messages. In this implementation, it uses the data source information as its topic name. The list of configuration information are stored in the metastore while data to be processed are stored separately from the metastore, and thus the message queue component in between would read the configuration information in the metastore dynamically and then the data will be updated when the tool is started and will be mapped with XML and JSON when the query is issued. If you are asking whether it is using distributed or centralized model for the purpose, it is both distributed and centralized since these are barely properties for specifying characteristics of the system. The distributed property means the same set of software items are run in more than one environment whereas the centralized property means the set of software items are managed in one place. They are not mutually exclusive as these are referring to the software having particular properties on different set of features. If they are having different functions in the same set of features, then there are conflicts in between. However, it doesn’t have to choose either distributed or centralized in this case because this tool is built for both distributed and centralized needs by its design for handling a real-world use case. Whether the data are placed in the centralized location does not depend on the aforementioned condition as there are systems that do not store the same set of data inside system environments depending on its data partitioning schemes used in various locations but these system environments can be managed in one place at the same time without conflicts in between, although the tool itself has already covered this case and is designed for centralizing common data models from the ground up.

Supporting schema versioning means the query requires to be changed also.

It isn’t necessary to change the query as it works similar to the way as in MongoDB that the query will retrieve the data using the corresponding version of the schema at the time it is created to preserve existing results. However, it is updated to make it possible to be controlled by users for the flexibility to cover use cases that change several data models at the same time. Although if it is enabled, it provides a benefit to query related information in the knowledge base and explore the known unknown data afterwards as Data Mining is specialized in using similarity measures for post-processings.

As this supports schema versioning, do you have plans to support data versioning features as well? Just like as of queries that make Oracle dominant in the market.

It’s relatively intuitive in a graph implementation compared with data management solutions in the market by simply organizing the data with a time axis as a dimension[3] for each of data items so the list of historical data can easily be retrieved without using flashback queries or hierarchical queries on time columns for the same purpose.

[3] Note: This is still in the area of science discussions instead of science fiction topics.

You mentioned Apache Hive, which is a data warehouse, has been adopted by most of the public cloud service providers already. What you are doing here sounds like reinventing the wheel?

It is a tool to solve usual problems in the area of the data processing, and thus it is normal to have lots of overlaps in between.
It was originally planned to be used as a tool to search for common data models only. However, there are some of the features that are comparatively different from Apache Hive nowadays:

  • Group and classify interrelated data types when new data are inserted

  • Update metadata on demand from the data source specified in the metastore as a centralized data catalog

  • Support generic queries by simplifying query conditions for XML and JSON comparisons

Using existing SQL implementations in the market users would involve several conversions (e.g. formats, data types, and metadata handlings) when performing the data wrangling process to extract useful information for comparisons:

SELECT * FROM origin
FULL OUTER JOIN Oracle.metadata ON origin.id = Oracle.metadata.id
FULL OUTER JOIN Cassandra.metadata ON Oracle.metadata.id = Cassandra.metadata.id
WHERE origin.record_item_id = Oracle.metadata.xml.record_item_id
AND Oracle.metadata.xml.record_item_id = Cassandra.metadata.json.record_item_id
/*
   Note SQL function signatures used below are varied depending on implementations where the
   get_xml_node_value function retrieves the XML node value in the array iterator one at a
   time, the extract function can be an implementation using XPath or XML Query since providers
   rarely conform to the industry standard in this area,and the JSON-LD metadata naming
   convention would be converted from @* to at_* in the data extraction stage to handle
   reserved keywords and then convert them back to @* for an easy-to-read report
*/
AND (
  (
    origin.member_item_id = ANY (
      SELECT get_xml_node_value(extract(ora1_mtdt.xml, "/at_context/member//at_id"))
      FROM Oracle.metadata ora1_mtdt
      WHERE origin.member_id IN (extract(ora1_mtdt.xml, "/at_context//member"))
    )
    AND
    origin.member_item_id <> ANY (
      SELECT json_extract(cs1_mtdt.json, "$.at_context.member[*].at_id")
      FROM Cassandra.metadata cs1_mtdt
      WHERE origin.member_id IN (json_extract(cs1_mtdt.json, "$.at_context.member[*]"))
    )
  ) OR (
    origin.member_item_id <> ANY (
      SELECT get_xml_node_value(extract(ora2_mtdt.xml, "/at_context/member//at_id"))
      FROM Oracle.metadata ora2_mtdt
      WHERE origin.member_id IN (extract(ora2_mtdt.xml, "/at_context//member"))
    )
    AND
    origin.member_item_id = ANY (
      SELECT json_extract(cs2_mtdt.json, "$.at_context.member[*].at_id")
      FROM Cassandra.metadata cs2_mtdt
      WHERE origin.member_id IN (json_extract(cs2_mtdt.json, "$.at_context.member[*]"))
    )
  )
)

This kind of problem can now be done in the following form benefited from dynamic mappings for XML and JSON:

SELECT * FROM origin
WHERE Oracle.metadata.xml.member.@id != Cassandra.metadata.json.member.@id
Shouldn’t there be a get_json_node_value function for JSON query conditions also? And what about data type conversions?

It is not necessary since it is used to retrieve results when the Oracle.metadata item is not equal to any of the Cassandra.metadata items one at a time instead of any of the combinations between Oracle.metadata items and Cassandra.metadata items. As for data type conversions, it depends on the return type of related SQL functions used in different implementations so these are written as a note previously and it would work as expected when those multivalued functions return an array of any type.

Is @context missing in your simplified query?

Their semantics are different, although results appear the same in this case since data records previously used do not cover cases that the structure to be searched is located at the different depth of the hierarchy. It can be written in this way without @context partly because of traversal properties of the internal graph implementation.
The code snippet was written to retrieve the result by traversing data records using the member.@id structure in any depth because a member.@id structure isn’t necessarily be located at the same depth as another one. If the code is changed to:

SELECT * FROM origin
WHERE Oracle.metadata.xml.@context.member.@id != Cassandra.metadata.json.@context.member.@id
Note

This would traverse data records using the @context.member.@id structure and since @context is the root element, it would not be compared with any of the member.@id items unless @context.member.@id itself is a recursive data structure. The expression used in the code snippet works in a way that is similar to performing traversal operations using XPath or XML Query for XML (e.g. //member) and JSONPath for JSON (e.g. $..member[*]). Therefore, the semantics of query conditions are necessary to be generalized in the context of its structure rather than its row in order to perform depth searches on data records for data model comparisons.