Wisser/Jailer

Add schema generation for extracted database

markovendelin opened this issue · 13 comments

Is your feature request related to a problem? Please describe.

I am evaluating Jailer as a tool to extract part of a database for publishing it. In research, when we collect the data in the lab databases as a part of experimental and analysis routine, it would be helpful if I can extract part of a data together with the schema of the related tables (ideally views and functions as well). Jailer looks to be perfect fit. I can define which table data to extract, associated extraction model, and generate SQL statements. Unfortunately, that generated SQL does not include schema creation.

Describe the solution you'd like

When starting Export tool:

  • include an option to generate schema creation statements
    • as suboptions, suggest to include related views, functions
  • allow to save schema creation SQL statements as a separate file for flexibility

Describe alternatives you've considered

Alternative would be to generate schema dump by database utilities and manually remove all unrelated tables. Sounds like an error-rich solution, though.

Additional context

In principle, Jailer could be used as a part of the publishing model for scientific databases. The generated SQL statements could be published as they are or used to generate extracted database and publish that.

Would it really make sense to extract a partial data model without the unrelated tables?
The data model would be incomplete. Wouldn't then the applications that work with the model have problems?

Excellent questions and it would depend on the context. For database application testing, probably not. For data extraction from the database used in scientific lab for publishing, it makes great sense.

In the lab, we are using the same database for linking data coming from different experiments. The database is used by several smaller applications that work only on the part of it. So, if I want to publish datasets describing one particular study, I would like to extract the data that is relevant to that study only. This can be done by Jailer through definition of the model and the relevance judged by the researcher. As a part of extraction, I don't need to define database schema with all possible experiments going on as it is irrelevant in the current publication. It would be preferred to extract only the relevant schema as well.

Please let me know if I missed something in your questions.

I see, that makes sense. Thanks for the insight.
I'll see how this could be realized. I'm just afraid it might be difficult to do this in a DBMS-independent way. Maybe the feature would only be available for some selected DBMSes. Which one would be relevant in this context?

I agree, that could be difficult to do in database-independent way. We are using PostgreSQL. Now thinking of it, maybe I would just need the list of exported tables as I can use pg_dump and specify tables of interest as it's argument as well as dump schema only. That way Jailer would only give the list of tables used in the export and it will be user who would extract schema accordingly.

This information is contained in the generated SQL script and could theoretically be extracted from it:

$ cat demo-scott.sql
-- generated by Jailer 9.5.5, Tue Sep 15 11:48:36 CEST 2020 from RalfW@W46810

-- Extraction Model:  EMPLOYEE where T.NAME='SCOTT'
-- Source DBMS:       H2
-- ...

-- Exported Rows:     13
--    DEPARTMENT                        2 
--    EMPLOYEE                          3 
--    PROJECT                           2 
--    PROJECT_PARTICIPATION             2 
--    ROLE                              2 
--    SALARYGRADE                       2 

...


$ grep '^--    ' demo-scott.sql | sed 's/--\s*//g' | sed 's/\s*[0-9]*//g'
DEPARTMENT
EMPLOYEE
PROJECT
PROJECT_PARTICIPATION
ROLE
SALARYGRADE

However, all tables from which no rows were exported (e.g. because the table in the source-database happens to be empty), but which are still relevant would be missing. Perhaps a cli-command would be useful, which would return a list of all tables that are potentially (transitively) related to the subject table?

vsgfe commented

Perhaps a cli-command would be useful, which would return a list of all tables that are potentially (transitively) related to the subject table?

This list would be useful for me as well. I generate reports about our tables and databases to help manage them. A list of the tables that are transferred between databases using Jailer would be very nice. I could not find an easy way to generate that list.

Perhaps a cli-command would be useful, which would return a list of all tables that are potentially (transitively) related to the subject table?

It would be useful, indeed. Ideally, it should stop associations according to the extraction model. Thus, if we have tables

A -> B -> C

and user disconnected B->C, this would be respected and that CLI tool will not print table C either. Or have such behavior as an option that can be switched on and off.

In the next release there will be the CLI tool "print-closure":

$ jailer.sh
usage:
...
  jailer print-closure <extraction-model> [<separator>] [-datamodel VAL]
    prints a list of all tables that are directly or transitively associated with a subject table,
    taking into account the restrictions on the associations (the so-called "Closure")
    <separator>: optional separator between table names in the output
...

$ jailer.sh print-closure extractionmodel\Demo-Scott.jm
BONUS
DEPARTMENT
EMPLOYEE
SALARYGRADE

$ jailer.sh print-closure extractionmodel\Demo-Scott.jm ", "
BONUS, DEPARTMENT, EMPLOYEE, SALARYGRADE

If you want to test this in advance, you can unzip the file in the attachment and replace the file "jailer.jar" with it.

jailer.zip

vsgfe commented

If you want to test this in advance, you can unzip the file in the attachment and replace the file "jailer.jar" with it.

jailer.zip

I did a quick test with our largest model (118 tables) and it works.
Thank you for this new feature!

Excellent, worked for me as well - exactly as expected. Please feel free to close the issue and thank you very much for your help!

Available in release 9.5.6.

Hi @Wisser

I am trying to use the CLI tool but I get an error:

2022-03-23 11:44:47,788 [main] ERROR  - './extractionmodel/LT_Canada.jm' does not exist
java.io.FileNotFoundException: './extractionmodel/LT_Canada.jm' does not exist
	at net.sf.jailer.extractionmodel.ExtractionModel.loadDatamodelFolder(ExtractionModel.java:522)
	at net.sf.jailer.Jailer.updateDataModelFolder(Jailer.java:383)
	at net.sf.jailer.Jailer.jailerMain(Jailer.java:274)
	at net.sf.jailer.Jailer.main(Jailer.java:149)
Error: java.io.FileNotFoundException: './extractionmodel/LT_Canada.jm' does not exist

Arguments:  0: {print-closure},  1: {./extractionmodel/LT_Canada.jm}

2022-03-23 11:44:47,797 [main] ERROR  - working directory is /opt/jailer-database-tools/lib/app

The model file definitely exist and is in $HOME/.jailer/extractionmodel

I have installed Jailer from the Arch Linux User Repository. It is installed in /opt. The gui works fine but I'm having trouble with the command line tools. Looks like it is using the wrong working directory. Any idea?

Thanks!

Hi @rbeucher

The script jailer.sh changes the working directory to /opt/jailer-database-tools/lib/app, so the path to the extraction model must be absolute.