FamilySearch/gedcomx

Adjust Gedcom X to work with RDBMS

Closed this issue ยท 5 comments

Hi,

Gedcom X, as an object supposed to be stored in XML or json, is object-oriented by nature.
This leads to an issue when trying to reflect the data into a Relational DataBase Management System DB, such as MSQL, MySql etc.

Perhaps, ANY type of data, field or complex-type, should have an ID field, to allow for efficient retrieval and storage to and from an RDBMS.

Besides, I tried to look up what IDs should be and what and how Identifiers should be used, whether there are any rules regarding this, and found this subject a bit unsolid.
Please clarify that, and the subject of my issue as well.

Thanks!

There is often a tendency to want to "flatten" GedcomX data so that it can be modeled easily in relational database tables or pipe-delimited flat files. However, genealogical data is inherently complex, multi-valued and hierarchical in nature, so there is no good general-purpose way to map flat fields to and from GedcomX, unless you have very strong constraints on what the structure of your data is allowed to be.

A GedcomX person, for example, can have any number of names, each with multiple name forms (e.g., to support Korean hangul, Korean hanja, and Romanized forms of the same name), each of which can have a full text and multiple parts (each with a part type, such as two given name parts, a surname, and a suffix).

If you happen to know that nobody in your database will have more than one name, more than one name form, and only one part of each type, you might be able to get away with flattening the name out.

In the databases we use to represent GedcomX data, we don't try to store it flat (i.e., as key-value pairs or rows and columns of data). Rather, we have used several ways of storing the structured data with the structure intact. One way this can be done is with BLOBs in relational databases. Another way is to use NoSQL databases such as Cassandra to store the entire data structure (e.g., in compressed JSON or XML). Sometimes a few data items are duplicated into columns to allow for some querying. We have typically used a separate search system for serious searches, in which case we do 'flatten' data back down in a lossy way, which is fine for non-canonical searching.

The other option, of course, is to attempt to create a relational database that models the entire structure of GedcomX. However, there are so many nested lists, each of which would require additional tables and joins, that it is unlikely that this would be an efficient way to operate. Furthermore, it is uncommon to need to query against all of that complexity anyway, which is why it has worked well to bundle things up into BLOBs or other structures, and only put the things into columns that specifically need to be queried.

--Randy Wilson
FamilySearch Information Architect

From: weitzhandler <notifications@github.commailto:notifications@github.com>
Reply-To: FamilySearch/gedcomx <reply@reply.github.commailto:reply@reply.github.com>
Date: Wednesday, January 7, 2015 at 12:43 AM
To: FamilySearch/gedcomx <gedcomx@noreply.github.commailto:gedcomx@noreply.github.com>
Subject: [gedcomx] IDs for each field or complex type (#278)

Hi,

I want to make a database that will reflect GX data.
However I'm not sure how to store all the affiliated data fields that have no ID, if I am to have a master-detail database.
I'm entirely new to genealogy programming and to GX (which looks like the most efficient of whatever I've checked, thanks GXers!).
I find the documentation on what IDs should be and what and how Identifiers should be users and whether there are any rules regarding this a bit unclear, I'd love to read about it.
Anyway, I wish ANY piece of info should have an optional ID field, so it can be stored efficiently in a Relational Database, such as any SQL etc., and retrieved easily.

I'd like to hear anything about it. Especially from OOP-oriented people, aiming to reflect the Object-Relational model to and from a RMDB.

โ€”
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_FamilySearch_gedcomx_issues_278&d=AwMCaQ&c=z0adcvxXWKG6LAMN6dVEqQ&r=3fQCgLpzci4__ptOksEV9Y-kP8XVW-94Tz8EbfPispI&m=xUyQ4mVhfvnkzAjNKl4fqcLS7S-38MbC9beBGXW8y5o&s=QQVylfl5p-cUkgHvwrLU7bhXIt0v0Y1sw692T652OGg&e=.

NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Hi rando, and thanks for your great response.
I haven't actually thought of this idea of splitting up the data, that could actually work just great.
Because I can store the less essential parts of the data as XML (which is natively searchable by the server!).
Great solution!

Another related question is, how do I optimize the date, which is something I would need to be able to lookup data with.

GedcomX has specifications for a formal date, and there is a FormalDate utility class for parsing and creating strings according to the specification.

However, I think you're asking "how do I standardize a date?", i.e., how to come up with a FormalDate from an original date string. A lot of people have done a bunch of work to parse genealogical dates (in various languages) in order to figure out what they mean. Dallan Quass open-sourced his place and name standardizers. I'm not sure if there's an open source date parser around to point you at.

From: weitzhandler <notifications@github.commailto:notifications@github.com>
Reply-To: FamilySearch/gedcomx <reply@reply.github.commailto:reply@reply.github.com>
Date: Wednesday, January 7, 2015 at 5:41 PM
To: FamilySearch/gedcomx <gedcomx@noreply.github.commailto:gedcomx@noreply.github.com>
Cc: Randy Wilson <wilsonr@familysearch.orgmailto:wilsonr@familysearch.org>
Subject: Re: [gedcomx] Adjust Gedcom X to work with RDBMS (#278)

Hi rando, and thanks for your great response.
I haven't actually thought of this idea of splitting up the data, that could actually work just great.
Because I can store the less essential parts of the data as XML (and it's searchable by the server!).
Great solution!

Another related question is, how do I optimize the date, which is something I would need to be able to lookup data with.

โ€”
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_FamilySearch_gedcomx_issues_278-23issuecomment-2D69118717&d=AwMCaQ&c=z0adcvxXWKG6LAMN6dVEqQ&r=3fQCgLpzci4__ptOksEV9Y-kP8XVW-94Tz8EbfPispI&m=yEWyYWEQtMilRIxYO_ufbcFCGlrDt5N7G-elutU-CZQ&s=ITCsAP4SNE91vzFvjRpvAWR9LBoEdp5VSJMStGAHplk&e=.

NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Why is this an open issue? It's not really a bug that can be closed/fixed - more of a user question, it has been answered and is four years old, so more activity is not very likely to happen ๐Ÿ˜„ Just noting that closing old issues might be worth a look.

I can recommend integrating StaleBot, btw.

Lemme close this issue then.