Integrate more data, like a SQL representation of MIxS and the EnvO semantic SQL database
Opened this issue · 6 comments
In psql, as an administrative user:
create database envo;
obtain or build the EnvO semantic-sql SQLlite file with something like
curl --output envo.db.gz https://s3.amazonaws.com/bbop-sqlite/envo.db.gz
gunzip envo.db.gz
back at the OS shell, load with connection strings like this:
pgloader sqlite://path/to/envo.db pgsql://postgres@localhost:15432/ncbi_biosamples_feb26
2024-02-28T15:44:16.016000Z LOG pgloader version "3.6.3~devel" 2024-02-28T15:44:17.104019Z LOG Migrating from #<SQLITE-CONNECTION sqlite:///home/mark/gitrepos/semantic-sql/envo.db {10072E61F3}> 2024-02-28T15:44:17.104019Z LOG Migrating into #<PGSQL-CONNECTION pgsql://postgres@localhost:15432/envo {10074B08D3}> 2024-02-28T15:45:56.765711Z LOG report summary reset table name errors rows bytes total time ------------------------------- --------- --------- --------- -------------- fetch 0 0 0.000s fetch meta data 0 29 0.056s Create Schemas 0 0 0.164s Create SQL Types 0 0 0.348s Create tables 0 36 18.928s Set Table OIDs 0 18 0.192s ------------------------------- --------- --------- --------- -------------- term_association 0 0 1.392s has_oio_synonym_statement 0 0 1.392s anonymous_class_expression 0 0 2.784s anonymous_individual_expression 0 0 2.776s owl_complex_axiom 0 0 4.184s rdf_list_statement 0 0 4.184s relation_graph_construct 0 0 6.136s entailed_edge 0 422947 15.9 MB 1m0.169s anonymous_expression 0 0 0.272s anonymous_property_expression 0 0 1.640s owl_restriction 0 0 2.968s prefix 0 1135 46.9 kB 9.628s rdf_level_summary_statistic 0 0 59.249s subgraph_query 0 0 1m6.653s repair_action 0 0 1m8.089s lexical_problem 0 0 1m8.817s problem 0 0 52.697s statements 0 108160 8.5 MB 1m1.865s ------------------------------- --------- --------- --------- -------------- COPY Threads Completion 0 4 1m10.253s Create Indexes 0 11 53.501s Index Build Completion 0 11 1.504s Reset Sequences 0 0 1.868s Primary Keys 0 6 2.972s Create Foreign Keys 0 0 0.000s Create Triggers 0 0 0.328s Install Comments 0 0 0.000s ------------------------------- --------- --------- --------- -------------- Total import time ✓ 532242 24.4 MB 2m10.426s
From ChatGPT
In PostgreSQL, you can't directly query tables from two different databases in a single SQL statement. However, you have a few options to achieve this:
-
Foreign Data Wrapper (FDW): PostgreSQL supports Foreign Data Wrappers, which allow you to access data from remote databases as if they were local tables. You can set up FDWs to connect to the remote databases and then query the tables from both databases in a single SQL statement.
-
Joining Data in Application Layer: Retrieve data from both databases separately in your application and then join the data programmatically.
-
Using dblink Extension: PostgreSQL has an extension called dblink that allows you to execute queries on a remote database. You can use this extension to query tables from another database within the same PostgreSQL instance.
Here's an example using the dblink extension:
-- First, install the dblink extension if not already installed
CREATE EXTENSION IF NOT EXISTS dblink;
-- Query data from the remote database
SELECT *
FROM dblink('dbname=remote_db_name user=remote_db_user password=remote_db_password',
'SELECT * FROM remote_table') AS remote_table_data(id INT, name TEXT);
-- Query data from the local database
SELECT *
FROM local_table;
-- Join data from both databases
SELECT *
FROM local_table
JOIN (
SELECT *
FROM dblink('dbname=remote_db_name user=remote_db_user password=remote_db_password',
'SELECT * FROM remote_table') AS remote_table_data(id INT, name TEXT)
) AS remote_table_data ON local_table.id = remote_table_data.id;
Replace 'remote_db_name'
, 'remote_db_user'
, and 'remote_db_password'
with the appropriate credentials for the remote database. Also, replace 'remote_table'
and 'local_table'
with the names of the tables you want to query.
In PostgreSQL, both databases and schemas are used for organizing and isolating objects, such as tables, views, functions, etc. Understanding when to use one over the other depends on the specific requirements of your application and the level of isolation you need. Here are some considerations:
-
Isolation Level:
- Database: Each database in PostgreSQL is completely isolated from other databases. This means data, users, permissions, and configurations are entirely separate. Use separate databases when you need complete isolation between sets of objects, such as when dealing with multiple applications or different stages of development (e.g., production, staging, development).
- Schema: Schemas provide logical namespaces within a single database. Objects within the same schema can interact with each other easily, sharing access permissions and configurations. Use schemas when you want to partition objects within a single database, such as organizing tables based on functionality or security requirements.
-
Resource Management:
- Database: PostgreSQL manages resources such as memory and disk space separately for each database. Isolating databases can help in resource management, especially in scenarios where different databases have different resource requirements or need to be managed independently.
- Schema: Schemas within a single database share the same resource pool. If your application has heavy resource requirements, it might be beneficial to isolate tables into separate databases for better resource management.
-
Security and Access Control:
- Database: Access control and security settings are applied at the database level. If you need different sets of users or permissions for different sets of data, separating them into different databases can provide better control.
- Schema: Permissions can be set at the schema level, allowing fine-grained control over access to specific sets of objects within a single database. Use schemas when you need to control access to different parts of your data within the same database.
-
Backup and Restore:
- Database: Backup and restore operations are performed at the database level. Isolating critical data into separate databases can make backup and restore operations more manageable.
- Schema: Since schemas are part of a single database, backup and restore operations encompass the entire database, including all schemas and objects within it.
In summary, use separate databases when you need complete isolation between sets of objects, such as different applications or environments. Use schemas within a single database when you want to logically partition objects while still sharing resources and configurations.
DROP TABLE public.anonymous_class_expression;
DROP TABLE public.anonymous_expression;
DROP TABLE public.anonymous_individual_expression;
DROP TABLE public.anonymous_property_expression;
DROP TABLE public.has_oio_synonym_statement;
DROP TABLE public.lexical_problem;
DROP TABLE public.owl_complex_axiom;
DROP TABLE public.owl_restriction;
DROP TABLE public.problem;
DROP TABLE public.rdf_level_summary_statistic;
DROP TABLE public.rdf_list_statement;
DROP TABLE public.relation_graph_construct;
DROP TABLE public.repair_action;
DROP TABLE public.subgraph_query;
DROP TABLE public.term_association;
MIxS
In the MIxS repo
Start with this so that structured_pattern
s are converted into pattern
s
poetry run gen-linkml \
--materialize-patterns \
--no-materialize-attributes \
--format yaml \
--output mixs-patterns-materialized.yaml src/mixs/schema/mixs.yaml
The following takes a while to run and will need to be filtered
date && time poetry run linkml2schemasheets-template \
--source-path mixs-patterns-materialized.yaml \
--output-path mixs-schemasheets-concise.tsv \
--debug-report-path mixs-schemasheets-concise-report.txt \
--log-file mixs-schemasheets-concise-log.txt \
--report-style concise
real 1m14.981s
user 1m14.681s
sys 0m0.296s
The other --report-style
is exhaustive
The debug-report-path
is more succinct than the log-file
.
Then filter and insert like this:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine(
'postgresql://postgres:<SECRET>@localhost:15432/ncbi_biosamples_feb26')
destination_table_name = 'mixs_global_slots'
df = pd.read_csv('mixs-schemasheets-concise.tsv', sep='\t')
# # Keep the first row
df = df[1:]
# Remove rows where the first column value starts with ">"
df = df[~df.iloc[:, 0].astype(str).str.startswith(">")]
# Keep rows with a value for "slot" but no value for "class"
df = df[df['slot'].notnull() & df['class'].isnull()]
# Remove rows where "domain" equals "MixsCompliantData"
df = df[df['domain'] != 'MixsCompliantData']
# Remove columns that are all null
df = df.dropna(axis=1, how='all')
df.to_csv('mixs-schemasheets-concise-global-slots.tsv', sep='\t', index=False)
# Create a table with the same columns as the DataFrame
df.to_sql(destination_table_name, engine, if_exists='replace', index=False)
# Commit the transaction
engine.dispose()
as user postgres
revoke all privileges on
all tables in schema public
from
biosample;
revoke all privileges on
all tables in schema public
from
biosample_guest;
grant
select
on
all tables in schema public to biosample;
grant
select
on
all tables in schema public to biosample_guest;
grant
select
on
all tables in schema public to biosample_owner;
alter user biosample_owner with SUPERUSER;