lindsay-stevens/openclinica_sqldatamart

handle multi-select list column names that aren't letters or numbers

Opened this issue · 3 comments

Use case
A study has the following kind of multi-select code list which is valid in OpenClinica.

  • -1: Missing
  • 1: Eggs
  • 2: Ham

Problem
When trying to build datamart (present in 2015.004), the character cleaning function dm_clean_name_string will remove the minus symbol and produce duplicate column names: [item_1, item_1, item_2]. This causes the database build to fail.

Limitations
Postgres, and most clients (SAS, Stata, Access) have object naming requirements that typically don't allow non alphanumeric characters.

Possible Solution
The relevant code is likely to be in dm_create_study_itemgroup_matviews.sql.

Implement detection of invalid code list characters and fall back to another column naming strategy for that portion of the identifier for the entire study. This would be similar to, but should not conflict with, the item name length to item oid fall back currently in place (this multi-select handling should happen first).

An alternative column naming strategy could be to use the code list order number, since it is:

  • a positive integer: contains only valid identifier characters.
  • present in the response options list table: for mapping back to the code list label.
  • unlikely to conflict with item name fall back: code lists tend to have < 100 items.

currently in progress on the branch dev201601

As well as the dev branch work, ideas relevant to this are written out in docs/plans/naming_strategy.md.

A simple solution recently occurred to me, which is probably very common and obvious: replace invalid identifier characters with their UTF-8 hex character code. This is pretty much what OpenClinica does for translations with non-Latin-1 characters.

So for example a hyphen "-" becomes the UTF-8 code "2d", as demonstrated by the following query. Latin-1 range characters translate to 2-character codes, and more elaborate character set codes take up to 5 characters.

SELECT 
  to_hex(ascii('-')) AS hex_hyphen,
  chr(('x' || lpad('2d', 8, '0'))::bit(32)::int) AS string_hyphen,
  to_hex(ascii('😊')) AS hex_smiling_face_with_smiling_eyes,
  chr(('x' || lpad('1F60A', 8, '0'))::bit(32)::int) AS string_smiling_face_with_smiling_eyes

This substitution could be indicated with the customary "u" prefix used in many languages, but also wrapped in underscores to help identify it, since the full customary prefix uses a backslash ("\u") which is an invalid identifier character. Using this method, the set of item names for the choices in the original post would be:

  • item__u2d_1
  • item_1
  • item_2

Pros:

  • Simpler to implement than a new name map system assigning some other kind of identifier,
  • Doesn't break things for existing users since anyone with this issue wouldn't have been able to build DataMart in the first place,
  • Could be handy to implement across DataMart to accommodate instances with i18n'd studies,

Cons:

  • UTF-8 codes are not necessarily meaningful to a consumer of data (could lookup in metadata still though),
  • Adds at least 5 new characters per invalid character, which may cause a name to become too long (need to add some kind of fallback name in the code).

Implemented the above hex code escape solution in 4a5f9eb.

Will close this issue & bump version if it seems to have fixed the problem following feedback from: