dracor-org/dracor-api

Specify a manifest format

Opened this issue · 1 comments

cmil commented

In order to populate a dracor-api instance with a defined set of documents and corpora we want to design a file format that describes the contents of the DraCor instance. This format should support the following use cases:

  • loading one or more corpora from their GitHub repos
  • loading multiple corpora from their GitHub into a new corpus
  • creating a corpus from individual TEI documents available on GitHub

As we discussed earlier, I tested this at several occasions and implemented it in the client:

Some examples (sorry, copied from various sources):

For the original "VeBiDraCor" that was used in the "Small World Paper" I used this notebook https://github.com/dracor-org/vebidracor/blob/main/vebidracor-workflow.ipynb

corpora_to_include = [
    { 
        "corpusname": "als",
        "repository": "https://github.com/dracor-org/alsdracor",
        "commit" : "c87ea41aac9412e4bd84a28e9c7632c53904f77c"
    },
    { 
        "corpusname": "bash",
        "repository" : "https://github.com/dracor-org/bashdracor",
        "commit" : "c16b58ef3726a63c431bb9575b682c165c9c0cbd"
    },
    { 
        "corpusname": "cal",
        "repository": "https://github.com/dracor-org/caldracor",
        "commit": "6cb804d415051d5f18bc4841fa1ce4343a7f0ab5"
    },
    { 
        "corpusname": "fre",
        "repository": "https://github.com/dracor-org/fredracor",
        "commit": "65e93f6ff632b367cdc7e16e3e390956856c4b98",
        "exclude" : ["fre000038", "fre000057", "fre000065", "fre000099"]
    },
    { 
        "corpusname": "ger",
        "repository": "https://github.com/dracor-org/gerdracor",
        "commit": "9135bd4598f54133f23df6edfc983b79f1616fb5",
        "exclude" : ["ger000480"]
    },
    { 
        "corpusname": "greek",
        "repository": "https://github.com/dracor-org/greekdracor",
        "commit": "7397aafa1927c3e0a0720bf3c00bf367ab679f26"
    },
    { 
        "corpusname": "hun",
        "repository": "https://github.com/dracor-org/hundracor",
        "commit": "57e64454a73ffd984ff5fcc1c9b7bc16f3a169f2"
    },
    { 
        "corpusname": "ita",
        "repository": "https://github.com/dracor-org/itadracor",
        "commit": "10c84b416d25a6cbfbb195b9f82f136e482a7093"
    },
    { 
        "corpusname": "rom",
        "repository": "https://github.com/dracor-org/romdracor",
        "commit": "20644eb44f59649721310c3a6d1fd1fe505653d5"
    },
    { 
        "corpusname": "rus",
        "repository": "https://github.com/dracor-org/rusdracor",
        "commit": "6d5b1e5549731538a48684a456006384da206e9a"
    },
    { 
        "corpusname": "shake",
        "repository": "https://github.com/ingoboerner/shakedracor",
        "commit" : "3a420de7d253a505d1d3b8225e6bb6659577d82f"
    },
    { 
        "corpusname": "span",
        "repository": "https://github.com/dracor-org/spandracor",
        "commit": "184ebf975ad9cd674ff37cab44a181fa7ed8d85f"
    },
    { 
        "corpusname": "swe",
        "repository": "https://github.com/dracor-org/swedracor",
        "commit": "0e73db9315c9c8ed64abff7d2053f84e76fcf7ec"
    },
    { 
        "corpusname": "tat",
        "repository": "https://github.com/dracor-org/tatdracor",
        "commit": "5c71364f39f6533baa3a2e04217fd39e0898c851"
    }
    ]

I reworked that to the manifest format currently used in the stable-dracor-client:
https://github.com/ingoboerner/stable-dracor/blob/df31c4e6b42d0e8c6ba294efe4d26aa473719ab2/notebooks/02_intro.ipynb
(CTL+F for manifest; there is also some documentation in the notebook)

Documentation of the system and its components in the manifest
When setting up a local DraCor infrastructure with the stable-dracor-client the system tries to 'document' itself, which means that the client can generate a data structure, the Manifest, that contains information on the system's components and the composition all corpora loaded. The objective of the manifest is to provide a means to fully describe a local DraCor system in such a way, that, by only relying on the manifest, the system can be re-created at some later stage. In the following section only the system and the sevices parts of the manifest are explained. The corpora will be introduced at a later stage when corpora have been added to the system.

{'version': 'v1',
 'system': {'id': '7f4f9ec9-40b2-4b92-8f33-5ef83714a12b',
  'name': 'my-stable-dracor',
  'description': 'DraCor system created with the introduction notebook to showcase the features of the stable-dracor-client.',
  'timestamp': '2023-11-23T13:25:31.445797'},
 'services': {'api': {'container': '8c9975f92468',
   'image': 'dracor/dracor-api:v0.90.1-local',
   'version': '0.90.1-2-g19a3f46-dirty',
   'existdb': '6.0.1'},
  'frontend': {'container': 'ac2e4e6d8a73',
   'image': 'dracor/dracor-frontend:v1.6.0-dirty'},
  'metrics': {'container': '3d8cc36cdf62',
   'image': 'dracor/dracor-metrics:v1.2.0'},
  'triplestore': {'container': '35802186a396',
   'image': 'dracor/dracor-fuseki:v1.0.0'}},
 'corpora': {}}

The field version defines the version of the manifest specification, which, in the current state of development will be v1. The field system contains the metadata provided when initializing a new instance (see section Attaching Metadata ...). Additonally there is a field timestamp that contains the date and time at which the system was described, i.e. the point in time when the manifest was generated by calling the method. The field services contains information on the individual system components, at least in allows to identify the Docker image (image) the container was created from. In the following cell we request the manifest and query for the image of the api service [...]

The manifest documents the consitution of added corpora. As explained in section on the manifest as a documentation of the system components the manifest can be output with the method get_manifest. Loaded corpora are documented in the field corpora. If you followed the notebook to this point the infrastructure contains three corpora with the names tat, dutch, kar.

{'tat': {'corpusname': 'tat',
  'timestamp': '2023-11-23T13:25:32.435087',
  'sources': {'tat': {'type': 'api',
    'corpusname': 'tat',
    'url': 'https://dracor.org/api/corpora/tat',
    'timestamp': '2023-11-23T13:25:32.435093',
    'num_of_plays': 3}},
  'num_of_plays': 3},
 'dutch': {'corpusname': 'dutch',
  'timestamp': '2023-11-23T13:25:37.631544',
  'sources': {'dutch': {'type': 'api',
    'corpusname': 'dutch',
    'url': 'http://staging.dracor.org/api/corpora/dutch',
    'timestamp': '2023-11-23T13:25:37.631550',
    'num_of_plays': 1}},
  'num_of_plays': 1},
 'kar': {'corpusname': 'kar',
  'timestamp': '2023-11-23T13:25:41.428662',
  'sources': {'bash': {'type': 'api',
    'corpusname': 'bash',
    'url': 'https://dracor.org/api/corpora/bash',
    'timestamp': '2023-11-23T13:25:41.428668',
    'exclude': {'type': 'slug', 'ids': ['khudayberdin-aq-bilettar']},
    'num_of_plays': 2}},
  'num_of_plays': 2}}

The second corpus (dutch) was copied from the DraCor staging instance at http://staging.dracor.org/, as is documented in the respective part of the manifest:

{'dutch': {'type': 'api',
  'corpusname': 'dutch',
  'url': 'http://staging.dracor.org/api/corpora/dutch',
  'timestamp': '2023-11-23T13:25:37.631550',
  'num_of_plays': 1}}

The field timestamp containes the date and time when the corpus was copied, the value of the field num_of_plays is the number of plays that were copied from the source corpus. In case of the third corpus that was added the manifest contains information about the excluded plays. The field exclude provides the information that the plays with the ids (ids; the type of the identifiers is slug, meaning "playname" consisting of author and title) were not copied from the source corpus with the identifier bash at the url https://dracor.org/api/corpora/bash:

{'corpusname': 'kar',
 'timestamp': '2023-11-23T13:25:41.428662',
 'sources': {'bash': {'type': 'api',
   'corpusname': 'bash',
   'url': 'https://dracor.org/api/corpora/bash',
   'timestamp': '2023-11-23T13:25:41.428668',
   'exclude': {'type': 'slug', 'ids': ['khudayberdin-aq-bilettar']},
   'num_of_plays': 2}},
 'num_of_plays': 2}

Bear in mind that the corpora published on the DraCor platform (production and staging) are so-called "living corpora". This means that to some of them plays are still being added and the encoding can change. Although the information when a corpus was copied and how many plays were available at that point in time, in most cases it will not posssible to re-create the exact same composition of this corpus at some later point in time. It must be noted that when using the copy mechanism the manifest alone is not a sufficent source to reproduce the contents of the system. If reproducibility is the goal, then the following method of adding data should be used.

When we output the manifest we see that the type of the source is repository (in case of copying it was api, see previous section) and the URL of the repository is included as url. In addition to a timestamp that contains date and time the process was initiated, the manifest contains the field commit. When calling the method as in the previous cell the client will fetch the data represented by the most recent commit. A commit represents the state of the data at a given point in time. This means, that if we know the commit (and the repository is still there, of course), we can precicely get the data in the state it was when it was commited.

{'corpusname': 'span',
 'timestamp': '2023-11-23T13:25:48.287503',
 'sources': {'span': {'type': 'repository',
   'corpusname': 'span',
   'url': 'https://github.com/dracor-org/spandracor',
   'commit': '184ebf975ad9cd674ff37cab44a181fa7ed8d85f',
   'timestamp': '2023-11-23T13:25:48.287508',
   'num_of_plays': 25}},
 'num_of_plays': 25}
{'corpusname': 'shake',
 'timestamp': '2023-11-23T13:29:52.227215',
 'sources': {'shake': {'type': 'repository',
   'corpusname': 'shake',
   'url': 'https://github.com/ingoboerner/shakedracor',
   'commit': '3a420de7d253a505d1d3b8225e6bb6659577d82f',
   'timestamp': '2023-11-23T13:29:52.227220',
   'num_of_plays': 37}},
 'num_of_plays': 37}

You could also look at the manifest of the capek drama corpus here: https://github.com/ingoboerner/stable-dracor/blob/capek/capek.ipynb

{'version': 'v1',
 'system': {'id': '7c3b5b0c-fcf9-4d7e-8f78-81bb3324e70f',
  'name': 'capek',
  'description': 'DraCor system with a corpus of czech plays written by the brothers Josef and Karel Čapek derived from CzDraCor.',
  'timestamp': '2023-06-30T17:57:59.527087'},
 'services': {'api': {'container': '8903093276e8',
   'image': 'dracor/stable-dracor:capek_v1',
   'version': '0.90.1-2-g19a3f46-dirty',
   'existdb': '6.0.1'},
  'frontend': {'container': 'e560033aa154',
   'image': 'dracor/dracor-frontend:v1.6.0-dirty'},
  'metrics': {'container': 'eabe714d3b42',
   'image': 'dracor/dracor-metrics:v1.2.0'},
  'triplestore': {'container': '6ed842bca11f',
   'image': 'dracor/dracor-fuseki:v1.0.0'}},
 'corpora': {'capek': {'corpusname': 'capek',
   'timestamp': '2023-06-30T17:57:32.317048',
   'sources': {'czedracor': {'type': 'repository',
     'url': 'https://github.com/dracor-org/czedracor',
     'commit': '18cc5be4009ad80ce0ff2123dde77158b344b4d6',
     'timestamp': '2023-06-30T17:57:32.317057',
     'num_of_plays': 11}},
   'num_of_plays': 11}}}

An example of the manifest of VeBiDraCor created with the "client":

{'system': {'description': 'DraCor system containing VeBiDraCor – a very big drama corpus with plays from several DraCor corpora',
  'id': '0e10877d-aa33-4a84-b8d8-43d961d0c40e',
  'name': 'vebidracor',
  'timestamp': '2023-07-04T17:24:46.462276'},
 'services': {'api': {'base-image': 'dracor/stable-dracor:vebidracor_v4',
   'existdb': '6.0.1',
   'image': 'dracor/stable-dracor:vebidracor_v4_arm64',
   'version': '0.90.1-2-g19a3f46-dirty'},
  'frontend': {'image': 'dracor/dracor-frontend:v1.6.0-dirty'},
  'metrics': {'image': 'dracor/dracor-metrics:v1.2.0'},
  'triplestore': {'image': 'dracor/dracor-fuseki:v1.0.0'}},
 'corpora': {'vebi': {'corpusname': 'vebi',
   'num_of_plays': '2979',
   'sources': {'alsdracor': {'commit': 'c87ea41aac9412e4bd84a28e9c7632c53904f77c',
     'num-of-plays': '25',
     'timestamp': '2023-07-04T14:04:43.354536',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/alsdracor'},
    'bashdracor': {'commit': 'c16b58ef3726a63c431bb9575b682c165c9c0cbd',
     'num-of-plays': '3',
     'timestamp': '2023-07-04T14:06:14.750666',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/bashdracor'},
    'caldracor': {'commit': '6cb804d415051d5f18bc4841fa1ce4343a7f0ab5',
     'num-of-plays': '205',
     'timestamp': '2023-07-04T14:13:51.889793',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/caldracor'},
    'greekdracor': {'commit': '7397aafa1927c3e0a0720bf3c00bf367ab679f26',
     'num-of-plays': '39',
     'timestamp': '2023-07-04T14:25:36.700070',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/greekdracor'},
    'hundracor': {'commit': '57e64454a73ffd984ff5fcc1c9b7bc16f3a169f2',
     'num-of-plays': '41',
     'timestamp': '2023-07-04T14:28:23.166521',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/hundracor'},
    'itadracor': {'commit': '10c84b416d25a6cbfbb195b9f82f136e482a7093',
     'num-of-plays': '139',
     'timestamp': '2023-07-04T14:33:11.775562',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/itadracor'},
    'romdracor': {'commit': '20644eb44f59649721310c3a6d1fd1fe505653d5',
     'num-of-plays': '36',
     'timestamp': '2023-07-04T14:59:19.492818',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/romdracor'},
    'rusdracor': {'commit': '6d5b1e5549731538a48684a456006384da206e9a',
     'num-of-plays': '212',
     'timestamp': '2023-07-04T15:01:36.587524',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/rusdracor'},
    'spandracor': {'commit': '184ebf975ad9cd674ff37cab44a181fa7ed8d85f',
     'num-of-plays': '25',
     'timestamp': '2023-07-04T15:37:03.986080',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/spandracor'},
    'swedracor': {'commit': '0e73db9315c9c8ed64abff7d2053f84e76fcf7ec',
     'num-of-plays': '68',
     'timestamp': '2023-07-04T15:45:14.108260',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/swedracor'},
    'tatdracor': {'commit': '5c71364f39f6533baa3a2e04217fd39e0898c851',
     'num-of-plays': '3',
     'timestamp': '2023-07-04T15:49:01.415072',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/tatdracor'},
    'gerdracor': {'commit': '9135bd4598f54133f23df6edfc983b79f1616fb5',
     'exclude': {'ids': ['kraus-die-letzten-tage-der-menschheit'],
      'type': 'slug'},
     'num-of-plays': '590',
     'timestamp': '2023-07-04T15:51:07.081096',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/gerdracor'},
    'fredracor': {'commit': '65e93f6ff632b367cdc7e16e3e390956856c4b98',
     'exclude': {'ids': ['anonyme-vende',
       'arnould-heroine-americaine',
       'audinot-dorothee',
       'becque-mere'],
      'type': 'slug'},
     'num-of-plays': '1556',
     'timestamp': '2023-07-04T16:18:15.267918',
     'type': 'repository',
     'url': 'https://github.com/dracor-org/fredracor'},
    'shakedracor': {'commit': '3a420de7d253a505d1d3b8225e6bb6659577d82f',
     'num-of-plays': '37',
     'timestamp': '2023-07-04T17:07:15.212941',
     'type': 'repository',
     'url': 'https://github.com/ingoboerner/shakedracor'}}}},
 'version': 'v1'}

A corpus created from multiple sources:

{'version': 'v1',
 'system': {'id': '2e14f86b-bd8a-4d00-afe4-081b41fdabd1',
  'name': 'my-stable-dracor',
  'timestamp': '2023-07-04T08:47:15.228637'},
 'services': {'api': {'container': '37fbb4d318d1',
   'image': 'dracor/dracor-api:v0.90.1-local',
   'version': '0.90.1-2-g19a3f46-dirty',
   'existdb': '6.0.1'},
  'frontend': {'container': '9517f341646f',
   'image': 'dracor/dracor-frontend:v1.6.0-dirty'},
  'metrics': {'container': 'f1808c0c0750',
   'image': 'dracor/dracor-metrics:v1.2.0'},
  'triplestore': {'container': 'af1800463d8f',
   'image': 'dracor/dracor-fuseki:v1.0.0'}},
 'corpora': {'multi': {'corpusname': 'multi',
   'timestamp': '2023-07-04T08:44:17.789436',
   'num_of_plays': 29,
   'sources': {'tatdracor': {'type': 'repository',
     'commit': '5c71364f39f6533baa3a2e04217fd39e0898c851',
     'url': 'https://github.com/dracor-org/tatdracor',
     'timestamp': '2023-07-04T08:44:31.516675',
     'num_of_plays': 3},
    'bashdracor': {'type': 'repository',
     'commit': 'c16b58ef3726a63c431bb9575b682c165c9c0cbd',
     'url': 'https://github.com/dracor-org/bashdracor',
     'timestamp': '2023-07-04T08:44:54.011271',
     'exclude': {'type': 'slug',
      'ids': ['karim-tashlama-utty', 'khudayberdin-aq-bilettar']},
     'num_of_plays': 1},
    'spandracor': {'type': 'repository',
     'commit': '184ebf975ad9cd674ff37cab44a181fa7ed8d85f',
     'url': 'https://github.com/dracor-org/spandracor',
     'timestamp': '2023-07-04T08:45:21.693915',
     'num_of_plays': 25}}}}}

here: https://github.com/ingoboerner/stable-dracor/blob/develop/test_client.ipynb

Serialized as Docker image labels:

{'com.docker.compose.config-hash': 'f3e7dbfe4bebdf9f284e9ee661814e9b4623d214ea7d49854b4a651aa8e98c7d',
 'com.docker.compose.container-number': '1',
 'com.docker.compose.depends_on': 'fuseki:service_started:false,metrics:service_started:false',
 'com.docker.compose.image': 'sha256:171df59ae0ab650356c45feeabe2a65b63b77c3e9bd7bf362926bfdd78e931f8',
 'com.docker.compose.oneoff': 'False',
 'com.docker.compose.project': 'my-stable-dracor',
 'com.docker.compose.project.config_files': '-',
 'com.docker.compose.project.working_dir': '/Users/ingoboerner/Projekte/dracor/stable-dracor',
 'com.docker.compose.service': 'api',
 'com.docker.compose.version': '2.17.3',
 'org.dracor.stable-dracor.corpora': 'multi',
 'org.dracor.stable-dracor.corpora.multi.corpusname': 'multi',
 'org.dracor.stable-dracor.corpora.multi.num-of-plays': '29',
 'org.dracor.stable-dracor.corpora.multi.sources': 'tatdracor,bashdracor,spandracor',
 'org.dracor.stable-dracor.corpora.multi.sources.bashdracor.commit': 'c16b58ef3726a63c431bb9575b682c165c9c0cbd',
 'org.dracor.stable-dracor.corpora.multi.sources.bashdracor.exclude.ids': 'karim-tashlama-utty,khudayberdin-aq-bilettar',
 'org.dracor.stable-dracor.corpora.multi.sources.bashdracor.exclude.type': 'slug',
 'org.dracor.stable-dracor.corpora.multi.sources.bashdracor.num-of-plays': '1',
 'org.dracor.stable-dracor.corpora.multi.sources.bashdracor.timestamp': '2023-07-04T08:44:54.011271',
 'org.dracor.stable-dracor.corpora.multi.sources.bashdracor.type': 'repository',
 'org.dracor.stable-dracor.corpora.multi.sources.bashdracor.url': 'https://github.com/dracor-org/bashdracor',
 'org.dracor.stable-dracor.corpora.multi.sources.spandracor.commit': '184ebf975ad9cd674ff37cab44a181fa7ed8d85f',
 'org.dracor.stable-dracor.corpora.multi.sources.spandracor.num-of-plays': '25',
 'org.dracor.stable-dracor.corpora.multi.sources.spandracor.timestamp': '2023-07-04T08:45:21.693915',
 'org.dracor.stable-dracor.corpora.multi.sources.spandracor.type': 'repository',
 'org.dracor.stable-dracor.corpora.multi.sources.spandracor.url': 'https://github.com/dracor-org/spandracor',
 'org.dracor.stable-dracor.corpora.multi.sources.tatdracor.commit': '5c71364f39f6533baa3a2e04217fd39e0898c851',
 'org.dracor.stable-dracor.corpora.multi.sources.tatdracor.num-of-plays': '3',
 'org.dracor.stable-dracor.corpora.multi.sources.tatdracor.timestamp': '2023-07-04T08:44:31.516675',
 'org.dracor.stable-dracor.corpora.multi.sources.tatdracor.type': 'repository',
 'org.dracor.stable-dracor.corpora.multi.sources.tatdracor.url': 'https://github.com/dracor-org/tatdracor',
 'org.dracor.stable-dracor.corpora.multi.timestamp': '2023-07-04T08:44:17.789436',
 'org.dracor.stable-dracor.services': 'api,frontend,metrics,triplestore',
 'org.dracor.stable-dracor.services.api.base-image': 'dracor/dracor-api:v0.90.1-local',
 'org.dracor.stable-dracor.services.api.existdb': '6.0.1',
 'org.dracor.stable-dracor.services.api.image': 'dracor/stable-dracor:my_multi_sources_corpus',
 'org.dracor.stable-dracor.services.api.version': '0.90.1-2-g19a3f46-dirty',
 'org.dracor.stable-dracor.services.frontend.image': 'dracor/dracor-frontend:v1.6.0-dirty',
 'org.dracor.stable-dracor.services.metrics.image': 'dracor/dracor-metrics:v1.2.0',
 'org.dracor.stable-dracor.services.triplestore.image': 'dracor/dracor-fuseki:v1.0.0',
 'org.dracor.stable-dracor.system.id': '2e14f86b-bd8a-4d00-afe4-081b41fdabd1',
 'org.dracor.stable-dracor.system.name': 'my-stable-dracor',
 'org.dracor.stable-dracor.system.timestamp': '2023-07-04T08:47:49.637773',
 'org.dracor.stable-dracor.version': 'v1'}