pkiraly/qa-catalogue

Add docker volume for scripts under ./catalogues

Phu2 opened this issue · 9 comments

Phu2 commented

Currently you cannot use the provided scripts (or some own scripts) under the ./catalogues path with Docker.

In docker-compose.yml add

    volumes:
      - ./catalogues:/opt/qa-catalogue/catalogues

Usage eg.:

docker exec -it metadata-qa-marc ./catalogues/test.sh all

@Phu2 I think there are two distinct problem here:

  1. the Docker instance (and the release.zip) does not contain the catalogue script
  2. how to run a custom script

I would like to handle them distinctly. The catalogues/ directory should be part of the zipped version and thus the docker container. But I do not think that the users who would like to use docker should have the full source code available in the host. So I suggest the following:

  1. instead putting your own script to catalogues, put it into some other directory, such as custom-script -- otherwise if the local catalogues contain only your script it would overwrite the already existing directory within the container
  2. mount this local directory with volume
  3. use it as
docker exec -it metadata-qa-marc ./custom-script/test.sh all

Do you agree with this plan?

I'd prefer not using the ./catalogue scripts for your own catalogues because most of these scripts could better be replaced by config files and ./qa-catalogue should be the main invocation point. Having multiple ways to invoce the application is confusing. At the moment you can

  1. call qa-catalogue with options and arguments
  2. call qa-catalogue with environment variables
  3. create a script that bypasses qa-catalogue but calls ./common-script

Calling common-script should be discouraged at at as this is an internal script that may change in the future.

By now only method 1 works with Docker. For method 2 calling docker exec with --env-file should work. We could extend docker/qa-catalogue to pass environment variables this way. Eventually the env-file can act as config file. Another method is to create a script that calls docker/qa-catalogue with your configuration.

@nichtich I agree with the principle to have a single point of entry. Does it imply that the scripts in catalogues should be rewriten to use qa-catalogue? I am sure we should have scripts, because these records the catalogue specific settings. If I understand correctly the env-file should be on the host, not on the container.

Take this example:

docker exec -it metadata-qa-marc --env-file catalogues/gent.sh ./qa-catalogue all

It will not work, if we have only the docker container, but not the source code.

Another problem I just recognize: docker compose up -d does not need any downloaded files (and this is one of the reason to use docker), however ./docker/qa-catalogue needs at least this file to be downloaded. However we do not need any more file, and the docker/ part require to create docker dir, then download the script there - I think this step is unnecessary. It should be clear in the README, such as

# download docker wrapper script
curl https://github.com/pkiraly/qa-catalogue/releases/download/v0.8.0/docker-qa-catalogue \
  --output qa-catalogue

# run the analyses
./qa-catalogue \
  --params "--marcVersion GENT --alephseq" \
  --mask "rug01.export" \
  --catalogue gent \
  all

supposing the ./docker/qa-catalogue file is part of the release file list as docker-qa-catalogue.

My conclusions

  1. we should have "canonical" catalogue specific scripts
  2. we also should provide a way to create custom catalogue specific scripts
  3. these scripts should work both inside and outside docker
  4. I agree to not use common-script directly
  5. we should keep in mind two use cases:
    a. the user has the QA catalogue as it is available in the release zip file
    b. the user uses only docker and if necessary a minimal number of other files (such as the docker wrapper script)
  1. we should have "canonical" catalogue specific scripts

I see them more as examples how to call qa-catalogue for known data sources.

  1. we also should provide a way to create custom catalogue specific scripts

We should not provide a way. People can call ./qa-catalogue or ./docker/qa-catalogue however they like. Sure it can be called via another bash script but also via Perl, Jenkins... The recommended way should be to directly call it.

  1. to 5.

Agreed!

One solution would be:

  1. remove . ./setdir.sh and . ./common-script lines from the catalogues/*.sh scripts
  2. modify qa-catalogue script:
    1. add a new parameter: -f, --config-file
    2. if this parameter is set and the referenced file is existing: run it (before displaying the variables)
  3. run it with
./qa-catalogue --config-file catalogues/gent.sh all

I tested and it works. Moreover it would work inside docker as well if the script is available inside (which can be achieved by applying the suggested volume parameter).

Phu2 commented

First i would like to give some context for this issue. As a first time user of QA catalogue i am trying to figure out how to use this tool. After scanning through the README i decided to give Docker a try. My expectation was that everything would work the same way with Docker. The Usage section doesn't differentiate between the two environments. In fact, the many ways to invoke the tool confuses me a lot. The scripts within the catalogs directory seemed to be a good start. To my surprise they are not available with Docker. That's why i created this issue. Sorry for the long read :)

./qa-catalogue as the single point of entry + config file sounds very good to me 👍

You could create a config directory which is mounted into the container by default. All catalogues/*.sh scripts can be replaced by config files, right?

I've renamed --config-file to --env-file because the same name is use in docker compose. It works with both ./qa-catalogue and ./docker/qa-catalogue. It is also possible to use the same config file to start the Docker container and to run qa-catalogue. I've also removed support of setdir.sh is ./qa-catalogue to have one preferred type of configuration files.

Fine for me.

The following environment variables are set in any of the catalogue files:

$ grep -oh '^[A-Z_]\+' catalogues/*.sh | sort | uniq -c
     19 MARC_DIR
     45 MASK
     45 NAME
      2 SCHEMA
      1 TYPE
     54 TYPE_PARAMS
      1 VERSION

Most files can directly be converted to an .env file by removal of calling setdir.sh and common-script.sh. Possible exceptions are dnb.sh, k10plus_pica.sh, k10plus_pica_grouped.sh. In addition every MARC_DIR=${BASE_INPUT_DIR}/xyz must be changed to INPUT_DIR=xyz.