turbomam/biosample-xmldb-sqldb

Integrate more data, like a SQL representation of MIxS and the EnvO semantic SQL database

Opened this issue · 6 comments

In psql, as an administrative user:

create database envo;

obtain or build the EnvO semantic-sql SQLlite file with something like

curl --output envo.db.gz https://s3.amazonaws.com/bbop-sqlite/envo.db.gz
gunzip envo.db.gz

back at the OS shell, load with connection strings like this:

pgloader sqlite://path/to/envo.db pgsql://postgres@localhost:15432/ncbi_biosamples_feb26
2024-02-28T15:44:16.016000Z LOG pgloader version "3.6.3~devel"
2024-02-28T15:44:17.104019Z LOG Migrating from #<SQLITE-CONNECTION sqlite:///home/mark/gitrepos/semantic-sql/envo.db {10072E61F3}>
2024-02-28T15:44:17.104019Z LOG Migrating into #<PGSQL-CONNECTION pgsql://postgres@localhost:15432/envo {10074B08D3}>
2024-02-28T15:45:56.765711Z LOG report summary reset
                     table name     errors       rows      bytes      total time
-------------------------------  ---------  ---------  ---------  --------------
                          fetch          0          0                     0.000s
                fetch meta data          0         29                     0.056s
                 Create Schemas          0          0                     0.164s
               Create SQL Types          0          0                     0.348s
                  Create tables          0         36                    18.928s
                 Set Table OIDs          0         18                     0.192s
-------------------------------  ---------  ---------  ---------  --------------
               term_association          0          0                     1.392s
      has_oio_synonym_statement          0          0                     1.392s
     anonymous_class_expression          0          0                     2.784s
anonymous_individual_expression          0          0                     2.776s
              owl_complex_axiom          0          0                     4.184s
             rdf_list_statement          0          0                     4.184s
       relation_graph_construct          0          0                     6.136s
                  entailed_edge          0     422947    15.9 MB        1m0.169s
           anonymous_expression          0          0                     0.272s
  anonymous_property_expression          0          0                     1.640s
                owl_restriction          0          0                     2.968s
                         prefix          0       1135    46.9 kB          9.628s
    rdf_level_summary_statistic          0          0                    59.249s
                 subgraph_query          0          0                   1m6.653s
                  repair_action          0          0                   1m8.089s
                lexical_problem          0          0                   1m8.817s
                        problem          0          0                    52.697s
                     statements          0     108160     8.5 MB        1m1.865s
-------------------------------  ---------  ---------  ---------  --------------
        COPY Threads Completion          0          4                  1m10.253s
                 Create Indexes          0         11                    53.501s
         Index Build Completion          0         11                     1.504s
                Reset Sequences          0          0                     1.868s
                   Primary Keys          0          6                     2.972s
            Create Foreign Keys          0          0                     0.000s
                Create Triggers          0          0                     0.328s
               Install Comments          0          0                     0.000s
-------------------------------  ---------  ---------  ---------  --------------
              Total import time          ✓     532242    24.4 MB       2m10.426s

From ChatGPT

In PostgreSQL, you can't directly query tables from two different databases in a single SQL statement. However, you have a few options to achieve this:

  1. Foreign Data Wrapper (FDW): PostgreSQL supports Foreign Data Wrappers, which allow you to access data from remote databases as if they were local tables. You can set up FDWs to connect to the remote databases and then query the tables from both databases in a single SQL statement.

  2. Joining Data in Application Layer: Retrieve data from both databases separately in your application and then join the data programmatically.

  3. Using dblink Extension: PostgreSQL has an extension called dblink that allows you to execute queries on a remote database. You can use this extension to query tables from another database within the same PostgreSQL instance.

Here's an example using the dblink extension:

-- First, install the dblink extension if not already installed
CREATE EXTENSION IF NOT EXISTS dblink;

-- Query data from the remote database
SELECT *
FROM dblink('dbname=remote_db_name user=remote_db_user password=remote_db_password',
            'SELECT * FROM remote_table') AS remote_table_data(id INT, name TEXT);

-- Query data from the local database
SELECT *
FROM local_table;

-- Join data from both databases
SELECT *
FROM local_table
JOIN (
    SELECT *
    FROM dblink('dbname=remote_db_name user=remote_db_user password=remote_db_password',
                'SELECT * FROM remote_table') AS remote_table_data(id INT, name TEXT)
) AS remote_table_data ON local_table.id = remote_table_data.id;

Replace 'remote_db_name', 'remote_db_user', and 'remote_db_password' with the appropriate credentials for the remote database. Also, replace 'remote_table' and 'local_table' with the names of the tables you want to query.

In PostgreSQL, both databases and schemas are used for organizing and isolating objects, such as tables, views, functions, etc. Understanding when to use one over the other depends on the specific requirements of your application and the level of isolation you need. Here are some considerations:

  1. Isolation Level:

    • Database: Each database in PostgreSQL is completely isolated from other databases. This means data, users, permissions, and configurations are entirely separate. Use separate databases when you need complete isolation between sets of objects, such as when dealing with multiple applications or different stages of development (e.g., production, staging, development).
    • Schema: Schemas provide logical namespaces within a single database. Objects within the same schema can interact with each other easily, sharing access permissions and configurations. Use schemas when you want to partition objects within a single database, such as organizing tables based on functionality or security requirements.
  2. Resource Management:

    • Database: PostgreSQL manages resources such as memory and disk space separately for each database. Isolating databases can help in resource management, especially in scenarios where different databases have different resource requirements or need to be managed independently.
    • Schema: Schemas within a single database share the same resource pool. If your application has heavy resource requirements, it might be beneficial to isolate tables into separate databases for better resource management.
  3. Security and Access Control:

    • Database: Access control and security settings are applied at the database level. If you need different sets of users or permissions for different sets of data, separating them into different databases can provide better control.
    • Schema: Permissions can be set at the schema level, allowing fine-grained control over access to specific sets of objects within a single database. Use schemas when you need to control access to different parts of your data within the same database.
  4. Backup and Restore:

    • Database: Backup and restore operations are performed at the database level. Isolating critical data into separate databases can make backup and restore operations more manageable.
    • Schema: Since schemas are part of a single database, backup and restore operations encompass the entire database, including all schemas and objects within it.

In summary, use separate databases when you need complete isolation between sets of objects, such as different applications or environments. Use schemas within a single database when you want to logically partition objects while still sharing resources and configurations.

DROP TABLE public.anonymous_class_expression;
DROP TABLE public.anonymous_expression;
DROP TABLE public.anonymous_individual_expression;
DROP TABLE public.anonymous_property_expression;
DROP TABLE public.has_oio_synonym_statement;
DROP TABLE public.lexical_problem;
DROP TABLE public.owl_complex_axiom;
DROP TABLE public.owl_restriction;
DROP TABLE public.problem;
DROP TABLE public.rdf_level_summary_statistic;
DROP TABLE public.rdf_list_statement;
DROP TABLE public.relation_graph_construct;
DROP TABLE public.repair_action;
DROP TABLE public.subgraph_query;
DROP TABLE public.term_association;

MIxS

In the MIxS repo

Start with this so that structured_patterns are converted into patterns

poetry run gen-linkml \
    --materialize-patterns \
    --no-materialize-attributes \
    --format yaml \
    --output mixs-patterns-materialized.yaml src/mixs/schema/mixs.yaml

The following takes a while to run and will need to be filtered

date && time poetry run linkml2schemasheets-template  \
    --source-path mixs-patterns-materialized.yaml \
    --output-path mixs-schemasheets-concise.tsv \
    --debug-report-path  mixs-schemasheets-concise-report.txt \
    --log-file mixs-schemasheets-concise-log.txt \
    --report-style concise

real 1m14.981s
user 1m14.681s
sys 0m0.296s

The other --report-style is exhaustive

The debug-report-path is more succinct than the log-file.

Then filter and insert like this:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine(
    'postgresql://postgres:<SECRET>@localhost:15432/ncbi_biosamples_feb26')
destination_table_name = 'mixs_global_slots'

df = pd.read_csv('mixs-schemasheets-concise.tsv', sep='\t')

# # Keep the first row
df = df[1:]

# Remove rows where the first column value starts with ">"
df = df[~df.iloc[:, 0].astype(str).str.startswith(">")]

# Keep rows with a value for "slot" but no value for "class"
df = df[df['slot'].notnull() & df['class'].isnull()]

# Remove rows where "domain" equals "MixsCompliantData"
df = df[df['domain'] != 'MixsCompliantData']

# Remove columns that are all null
df = df.dropna(axis=1, how='all')

df.to_csv('mixs-schemasheets-concise-global-slots.tsv', sep='\t', index=False)

# Create a table with the same columns as the DataFrame
df.to_sql(destination_table_name, engine, if_exists='replace', index=False)

# Commit the transaction
engine.dispose()

as user postgres

revoke all privileges on
all tables in schema public
from
biosample;

revoke all privileges on
all tables in schema public
from
biosample_guest;

grant
select
	on
	all tables in schema public to biosample;

grant
select
	on
	all tables in schema public to biosample_guest;

grant
select
	on
	all tables in schema public to biosample_owner;

alter user biosample_owner with SUPERUSER;