Choosing another schema for chado (for testing)

Question

Choosing another schema for chado (for testing)

Closed this issue 2 years ago · 10 comments

This was a great suggestion by Valentin Guignon in the Tripal Core Meeting on 2nd June 2020 in which I also think could be useful especially for preserving the data in live data implementations.

I am however wondering how we can go about getting this new functionality in, particularly if we are going to have a different schema for a test version of chado - how would we be able to link a specific content entity to use the test schema (this might be simple but due to my lack of experience at D8 and T4 at the moment).

Answer 1 · 2020-06-03T07:16:42.000Z

To clarify, it is not so complicated but requires some tricks. Well, in fact, there can be several approaches.

The first one would be the same as previous Tripal versions: just provide hooks that can alter the Chado schema name in every queries.

The second one would be to merge the features of Tripal Multi Chado module into Tripal 4 core. I personally would be in favor of that approach but I understand it can be a bit too complicated for most (new) Tripal developpers who are not very familiar with PostgreSQL subtilities (temporary views overriding tables for instance).

Answer 2 · 2020-06-03T07:51:22.000Z

A comment on why including this feature in Tripal would be a good thing. Handling several Chado schema in one database allows:
-sharing a same configuration for several dataset;
-having one or more private and public dataset on one site; access restriction can be easily and safely managed through Drupal (no need to write an admin interface);
-unit/functional testing on a blank chado database (database loading tests, data update tests,...); without this feature, if you did tests on an existing chado schema with data, even using transactions, several things will remained altered after the tests such as sequences (ie. index incrementation);
-having chado sandboxes on your system.

The last one is a requirement for me. On MGIS, we allow banana collection curators to update data about their collections on the "live" site. In fact, I use the multi-chado to generate an instant copy of the live chado schema and they make their modification on that copy. They can use MGIS on that copy, browse data and see the changes on the "live copy" site while the real "live" chado schema other users see is kept unchanged. Later, after validation, the admin can apply the changes on the live chado instance so everybody can see updated data.

Answer 3 · 2020-06-03T08:53:10.000Z

And another separated comment (to ease quotes) about technical details on how to implement multiple chado instance support.

In Chado-related queries, chado tables can be either prefixed with the chado schema name or not prefixed but the PostgreSQL "search_path" is changed making so the chado schema will be the first place where PostgreSQL will look for tables. So the 2 following things can be done:

SELECT * FROM chado.feature;

or

SET search_path TO chado,public;
SELECT * FROM feature;

In Tripal, the API is usually used to query the database and developers wont write SQL queries. So, having a hook that can alter the default schema name in Tripal API is a solution. Everywhere the chado schema name is needed, the function tripal_get_schema_name($schema_name) will be used. $schema_name can be either "drupal" (and/or "default" and/or ""?) or "chado" and that function provides a hook to alter the returned value. A hook implementation could, for instance, check a session variable to see which chado schema to enable.
But sometimes people also write their own SQL query and that query must also be altered in order to select the appropriate chado schema. That's why there is also a hook to alter chado SQL queries.
With a regular expression, the query can be altered to replace chado.table_name with another chado schema name.

At this step, we got a way to make sure every chado query will hit the right schema. However, as far as I know (did it change in Tripal 4?), to let Drupal see Chado data as entities, we use a "linkage" table that associate a Drupal chado entity to a Chado table row identifier. Having several schema means that we can have several Drupal chado entities pointing to the same chado table with the same row identifier.

To solve that, either Tripal can natively include the source schema name in the linkage table and ignore (?) entries that do not belong to current Chado schema or do not include that column and another external module like the multi-chado will do the trick. The trick used by multi-chado is to "override" the linkage table by a temporary view that has the same columns of the linkage table but queries another table that has the additional column with the chado schema name. PosgreSQL temporary views can "override" a table as long as the SQL query does not qualify the table with it schema name (in fact, in PostgreSQL search_path, the very first place where postgres search is the current user session temporary schema and then it uses the search_path). Furthermore temporary views are only living during a PostgreSQL session and are gone away once the session is over, so different PostgreSQL connections for different users can have different temporary views overriding a same table at the same time. The temporary view just needs to be created as early as possible before any query is made on the table (hook_init()).

Answer 4 · 2020-06-03T09:12:39.000Z

Therefore, my recommendations would be to provide a function that returns current Chado schema name and provides a hook to alter its answer. For future improvements, that function should take a schema name as parameter so we could be able, one day, to query several chado instances at the same time or even other types of schema (if there will be chado alternative schema).
function tripal_get_schema_name($schema_name)
$schema_name: if empty or "default" or "drupal" or "public", it will return the "public" schema name. If "chado", it will return the "chado" schema name. for other values, it will return the same name provided. The returned value can be altered by hooks.
Should this function also have a "array $context" parameter for future uses that would be provided to hook implementations? I don't see how it could be used so I would say no.

Every Tripal core queries should use that function when the Chado schema is needed (especially when altering search_path). Tripal extensions should be aware of that function and developers must use it when they write their own SQL queries.

On Drupal side, Drupal entities linked to Chado data should include the chado schema name. Entity permissions can be managed to forbid access to entities that are not in current Chado schema as a default behavior that could be overridden by Tripal extensions.

[edit] It may would be also nice to have a "Chado Connection" entity type that is used to manage connections to Chado. I don't know if Drupal 8/9 allow to modify the database connection but if it is possible (like it was with Drupal 7), it would be greate because we can manage a higher level of security with that as schema access can be also restricted at a PostgreSQL user level. It means that if you have a private chado schema, you could restrict its access to a special user different from the Drupal one, so even if Drupal database account is compromised, the private chado schema could remain protected (depending on the implementation of the password management and how postgreSQL user accounts are configured). See Tripal MC Chado Connection source code and Drupal connection altering (tripal_mc_connection_init) for details.

...and I also forgot one important thing, with current approach, other schema such as frange, genetic_code and so are shared across chado schemas. It might be a problem for some people but I don't know who.

So far, it's what I propose. If I have further though, I'll post them here.

Answer 5 · 2020-06-03T10:24:26.000Z

More things:

Chado schema instanciation: the Chado SQL script used to instanciate Chado should not contain the chado schema name or it should be possible to alter the schema name easily before the script is run (currently the Multi-Chado does it by altering the raw SQL query which is not really nice).
It would be nice to provide a function that can clones a Chado schema. I have made such a function for MGIS which runs a set of SQL queries using Drupal batch API. Here is the SQL code:

CREATE OR REPLACE FUNCTION mgis_clone_schema(
  source_schema text,
  dest_schema text,
  include_recs boolean
)
RETURNS void AS
$BODY$

-- This function will clone all sequences, tables, data, views & functions from
-- any existing schema to a new one.
-- If you don't need to clone data, just set include_recs parameter to FALSE.
-- SAMPLE CALL:
-- SELECT mgis_clone_schema('public', 'new_schema', TRUE);

DECLARE
  src_oid          oid;
  tbl_oid          oid;
  func_oid         oid;
  dbobject         text;
  buffer           text;
  srctbl           text;
  default_         text;
  column_          text;
  qry              text;
  dest_qry         text;
  v_def            text;
  seqval           bigint;
  sq_last_value    bigint;
  sq_max_value     bigint;
  sq_start_value   bigint;
  sq_increment_by  bigint;
  sq_min_value     bigint;
  sq_cache_value   bigint;
  sq_log_cnt       bigint;
  sq_is_called     boolean;
  sq_is_cycled     boolean;
  sq_cycled        char(10);

BEGIN

  -- Check that source_schema exists.
  SELECT oid
  INTO src_oid
  FROM pg_namespace
  WHERE nspname = source_schema;

  IF NOT FOUND THEN
    RAISE NOTICE 'source schema % does not exist!', source_schema;
    RETURN;
  END IF;

  -- Check that dest_schema does not yet exist.
  PERFORM TRUE
  FROM pg_namespace
  WHERE nspname = dest_schema;

  IF FOUND THEN
    RAISE NOTICE 'dest schema % already exists!', dest_schema;
    RETURN;
  END IF;

  -- Create a new schema.
  EXECUTE 'CREATE SCHEMA ' || dest_schema || ';';

  -- Create sequences.
  FOR dbobject IN
    SELECT sequence_name::text
    FROM information_schema.sequences
    WHERE sequence_schema = source_schema
  LOOP
    EXECUTE
      'CREATE SEQUENCE '
      || dest_schema
      || '.'
      || quote_ident(dbobject)
      || ';'
    ;
    srctbl := source_schema || '.' || quote_ident(dbobject);

    EXECUTE
      'SELECT
        last_value,
        max_value,
        start_value,
        increment_by,
        min_value,
        cache_value,
        log_cnt,
        is_cycled,
        is_called
      FROM '
      || source_schema
      || '.'
      || quote_ident(dbobject)
      || ';'
    INTO
      sq_last_value,
      sq_max_value,
      sq_start_value,
      sq_increment_by,
      sq_min_value,
      sq_cache_value,
      sq_log_cnt,
      sq_is_cycled,
      sq_is_called
    ;

    IF sq_is_cycled THEN
      sq_cycled := ' CYCLE';
    ELSE
      sq_cycled := ' NO CYCLE';
    END IF;

    EXECUTE
      'ALTER SEQUENCE '
      || dest_schema
      || '.'
      || quote_ident(dbobject)
      || ' INCREMENT BY '
      || sq_increment_by
      || ' MINVALUE '
      || sq_min_value
      || ' MAXVALUE '
      || sq_max_value
      || ' START WITH '
      || sq_start_value
      || ' RESTART '
      || sq_min_value
      || ' CACHE '
      || sq_cache_value
      || sq_cycled
      || ' ;'
    ;

    buffer := dest_schema || '.' || quote_ident(dbobject);
    IF include_recs THEN
      EXECUTE
        'SELECT setval( '''
        || buffer
        || ''', '
        || sq_last_value
        || ', '
        || sq_is_called
        || ');'
      ;
    ELSE
      EXECUTE
        'SELECT setval( '''
        || buffer
        || ''', '
        || sq_start_value
        || ', '
        || sq_is_called
        || ');'
      ;
    END IF;
  END LOOP;

  -- Create tables.
  FOR dbobject IN
    SELECT TABLE_NAME::text
    FROM information_schema.tables
    WHERE
      table_schema = source_schema
      AND table_type = 'BASE TABLE'
  LOOP
    buffer := dest_schema || '.' || quote_ident(dbobject);
    EXECUTE
      'CREATE TABLE '
      || buffer
      || ' (LIKE '
      || source_schema
      || '.'
      || quote_ident(dbobject)
      || ' INCLUDING ALL)'
    ;

    IF include_recs THEN
      -- Insert records from source table.
      EXECUTE
        'INSERT INTO '
        || buffer
        || ' SELECT * FROM '
        || source_schema
        || '.'
        || quote_ident(dbobject)
        || ';'
      ;
    END IF;
  END LOOP;

  -- Add table constraints.
  FOR dbobject IN
    SELECT TABLE_NAME::text
    FROM information_schema.tables
    WHERE
      table_schema = source_schema
      AND table_type = 'BASE TABLE'
  LOOP
    buffer := dest_schema || '.' || quote_ident(dbobject);
    -- Treat non-schema-prefixed sequences.
    FOR column_, default_ IN
      SELECT
        column_name::text,
        REPLACE(column_default::text, 'nextval(''', 'nextval(''' || dest_schema || '.')
      FROM information_schema.COLUMNS
      WHERE
        table_schema = dest_schema
        AND TABLE_NAME = dbobject
        AND column_default LIKE
          'nextval(%::regclass)'
        AND column_default NOT LIKE
          'nextval(''' || source_schema || '.%::regclass)'
    LOOP
      EXECUTE
        'ALTER TABLE '
        || buffer
        || ' ALTER COLUMN '
        || column_
        || ' SET DEFAULT '
        || default_
      ;
    END LOOP;
    -- Treat schema-prefixed sequences.
    FOR column_, default_ IN
      SELECT
        column_name::text,
        REPLACE(column_default::text, source_schema || '.', dest_schema || '.')
      FROM information_schema.COLUMNS
      WHERE
        table_schema = dest_schema
        AND TABLE_NAME = dbobject
        AND column_default LIKE
          'nextval(''' || source_schema || '.%::regclass)'
    LOOP
      EXECUTE
        'ALTER TABLE '
        || buffer
        || ' ALTER COLUMN '
        || column_
        || ' SET DEFAULT '
        || default_
      ;
    END LOOP;
  END LOOP;

  --  Add FK constraints.
  FOR qry IN
    SELECT
      'ALTER TABLE '
      || dest_schema
      || '.'
      || quote_ident(rn.relname)
      || ' ADD CONSTRAINT '
      || quote_ident(ct.conname)
      || ' '
      || regexp_replace(
        pg_get_constraintdef(ct.oid),
        'REFERENCES (' || source_schema || '\.)?',
        'REFERENCES ' || dest_schema || '.'
      )
      || ';'
    FROM pg_constraint ct
      JOIN pg_class rn ON rn.oid = ct.conrelid
    WHERE
      connamespace = src_oid
      AND rn.relkind = 'r'
      AND ct.contype = 'f'
  LOOP
    EXECUTE qry;
  END LOOP;

  -- Create views.
  FOR dbobject IN
    SELECT
      table_name::text,
      view_definition
    FROM information_schema.views
    WHERE table_schema = source_schema
  LOOP
    buffer := dest_schema || '.' || quote_ident(dbobject);
    SELECT view_definition
    INTO v_def
    FROM information_schema.views
    WHERE
      table_schema = source_schema
      AND table_name = quote_ident(dbobject)
    ;
    EXECUTE
      'CREATE OR REPLACE VIEW '
      || buffer
      || ' AS '
      || v_def
      || ';'
    ;
  END LOOP;

  -- Create functions.
  FOR func_oid IN
    SELECT oid
    FROM pg_proc
    WHERE
      pronamespace = src_oid
      AND proisagg = FALSE
  LOOP
    SELECT pg_get_functiondef(func_oid) INTO qry;
    SELECT replace(qry, source_schema, dest_schema) INTO dest_qry;
    EXECUTE dest_qry;
  END LOOP;

  -- Create aggregate functions.
  FOR func_oid IN
    SELECT oid
    FROM pg_proc
    WHERE
      pronamespace = src_oid
      AND proisagg = TRUE
  LOOP
    SELECT
      'CREATE AGGREGATE '
      || dest_schema
      || '.'
      || p.proname
      || '('
      || format_type(a.aggtranstype, NULL)
      || ') (sfunc = '
      || a.aggtransfn
      || ', stype = '
      || format_type(a.aggtranstype, NULL)
      || CASE
          WHEN op.oprname IS NULL THEN ''
          ELSE ', sortop = ' || op.oprname
        END
      || CASE
          WHEN a.agginitval IS NULL THEN ''
          ELSE ', initcond = ''' || a.agginitval || ''''
        END
      || ')'
    INTO dest_qry
    FROM pg_proc p
      JOIN pg_aggregate a ON a.aggfnoid = p.oid
      LEFT JOIN pg_operator op ON op.oid = a.aggsortop
    WHERE p.oid = func_oid;
    EXECUTE dest_qry;
  END LOOP;

  RETURN;

END;

$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;

And here is the piece of PHP code in the batch operation called by Drupal batch API to update progress bar:

    // Create a new Chado sandbox.
    $context['message'] = t('Creating sandbox database. This may take a couple of minutes. Initializing Chado schema.');
    if (!isset($context['sandbox']['clone_sandbox_ps'])) {
      // Get current database size.
      $query = '
        SELECT pg_catalog.pg_database_size(d.datname) AS "size"
        FROM pg_catalog.pg_database d
        WHERE d.datname = current_database();
      ';
      $result = db_query($query);
      $db_size = $result->fetchAssoc()['size'];
      $context['sandbox']['db_size'] = $db_size;
      
      // Get Chado size.
      $query = "
        SELECT
          SUM(
            pg_total_relation_size(
              quote_ident(schemaname)
              || '.'
              || quote_ident(tablename)
            )
          )::BIGINT AS \"size\"
        FROM pg_tables
        WHERE schemaname = 'chado';
      ";
      $result = db_query($query);
      $chado_size = $result->fetchAssoc()['size'];
      $context['sandbox']['chado_size'] = $chado_size;

      // Clone Chado.
      $sql_query = "SELECT mgis_clone_schema('chado', 'chado_" . $coll_code_lower . "', TRUE);";
      $cmd =
        MGIS_DRUSH_PATH
        . ' -r '
        . DRUPAL_ROOT
        . ' -l '
        . $base_url
        . ' sqlq '
        . escapeshellarg($sql_query);
      exec(sprintf('/usr/bin/nohup %s >%s 2>&1 & echo $!', $cmd, drupal_realpath($output_file)), $pid_arr);
      $context['sandbox']['clone_sandbox_ps'] = $pid_arr[0];
      // A value below 1 enables the callback to be called again for next lines.
      $context['finished'] = 0.21;
    }
    else {
      // Check if process is done.
      $pid = $context['sandbox']['clone_sandbox_ps'];

      // Compute progress.
      $query = '
        SELECT pg_catalog.pg_database_size(d.datname) AS "size"
        FROM pg_catalog.pg_database d
        WHERE d.datname = current_database();
      ';
      $result = db_query($query);
      $db_size = $result->fetchAssoc()['size'];
      $progress = ($db_size - $context['sandbox']['db_size']) / $context['sandbox']['chado_size'];
      if (0 >= $progress) {
        $progress = 0.02;
      }
      else if (1 < $progress) {
        $progress = 1;
      }

      // A value below 1 enables the callback to be called again for next lines.
      $context['finished'] = 0.20 + (0.60 * $progress);
      
      if (!shell_exec(sprintf("ps h -o pid %d", $pid))) {
        // Done. Go to next step.
        $context['message'] = t('Creating sandbox database. This may take a couple of minutes. Initializing Chado schema. Done.');
        $context['sandbox']['dbinit'] = 3;
        // A value below 1 enables the callback to be called again for next lines.
        $context['finished'] = 0.79;
      }
    }

Answer 6 · 2020-06-03T15:50:05.000Z

Thank you so much for taking the time to provide all this information @guignonv! I will keep it in mind while working on the Tripal Chado upgrade and see how many of these features we can incorporate starting with the chado installer including the ability to specify schema name!

For now, I'm focusing my thoughts on:

Ability to install chado in a different schema name.
Upgrade current hook system for chado queries perhaps supporting user-associated chado schema name.
Chado schema name in entity linking tables with associated Drupal permissions (including alter hook).
And of course, associated automated testing.

With backwards compatibility through smart defaults in mind.

Would that combination of features help you towards your goals @guignonv?

For the future, I'm really interested in using a separate chado for testing purposes. I'm also personally interested in sandbox instances so that I can stage unpublished data on KnowPulse.

Answer 7 · 2020-06-08T20:07:33.000Z

Sounds good to me! :-)

Answer 8 · 2021-01-04T23:37:33.000Z

Completed:
✅ Chado can be installed in different named schema with multiple chado supported by the installer.
✅ The Chado Query API handles multiple chado installations though allowing the chado schema to be passed in and propagated through all commands.
✅ All testing is done in a testchado named schema with automated testing environments not even having a chado named schema! This ensures everything works with different schema names :-)
✅ All currently upgraded APIs work with the schema name not being chado
✅ Upgrade current hook system for chado queries

In Progress for first release:

Tripal Vocabularies will be linked to chado in a flexible manner.
Tripal Fields will be linked to chado in a flexible manner. (Entities are not linked to chado to allow fields to pull from different schema as needed)

Answer 9 · 2021-08-18T17:09:23.000Z

So far, in current code (at the time writing this), multiple chado schema are supported by the Chado API which is a good thing. I've added in PR #119 the PostgreSQL functions to clone schemas.

I think this discussion started on the idea of using a different chado schema for testing while in the end, the discussion goes a bit wider to just using different chado schema for different purpose.
I'd like to distinguish "test chado schemas" and "multiple chado schema" problematic.
The first one should be discussed in this topic and the other one in a new topic #120.

So, to get back to the test part, what I see so far, would be to generate a test schema on which tests should be run. But -point 1- we will need to be able to run tests on different versions of Chado, especially when we want to test a version upgrade procedure. Also, -point 2- we may need to allow tests to be run concurrently (continuous integration with multiple developers) on a same database. Another thing is that -point 3- tests are long to run and we don't want to add more time by creating a new schema from scratch every time we need to run a test: we need a fast method to generate a test schema. Finally, -point 4- we may want to allow users to run tests on their data or at least a subset of their data they can share when they want to report issues.

What I propose to take into account all these points is to have a function in ChadoSchema API that would be able to generate a test schema on demand by cloning a template. When a test is run, it would call that function which will return the name of a brand new test schema cloned (point 2) from a given template (point 1). The cloning procedure is faster than executing a series of SQL files (point 3) and it would allow (point 4) user to customize templates. If a template is missing, the function should generate one from SQL files (just once). We could have one template per Chado (supported) versions (point 1).

We will want to prevent users from using schema name reserved for tests. This will be/is achieved by ChadoSchema::isInvalidSchemaName(). Since this function will be also called during tests and we don't want it to prevent tests from using their reserved schema names, we must add a "test" flag that would allow the use of such schema names during tests. That would be the purpose of ChadoSchema::testMode(TRUE) which should be run at the beginning of each test on Chado.

I think every test schema should be generated with a random name (just like Drupal test database are... well, I think it's a table prefix in fact but it the same concept). It should start with "_chado_test" so test schema could be easily identified (ie. "_chado_test_gp4tu57un3oe"). Template schemas should start with "_chado_template" (ie. "_chado_template_13").

Currently, I see 2 other points that would need discussion:

how to cleanup old test schema when tests are done? If a test crashed, it may not cleanup its schema. Should we add a schema expiration date somewhere that could help a cleanup cron to know which schema can be removed. Should the expiration date be specified when calling the test schema creation function?
it would be nice to select which schema to use as template when running tests. Of course, some tests may require a specific version so they would explicitly specify a template but what if we would like to run tests on real data (either own data or from a dump provided by someone who reported an issue and we would like to see what's going on with his/her data)? Tests datasets are often small to run faster but sometimes, the behavior is different on larger datasets.

Answer 10 · 2022-12-05T23:08:04.000Z

I'm closing this as @guignonv implemented it all in his submission of Tripal DBX (originally BioDB). We already have multiple chado support and integration of test versions of chado for automated tests. We can also select which schema to test on and prepopulate it as we would with Tripal prepare.