tbrand/neph

neph.yaml redesign

notramo opened this issue Β· 34 comments

I think the current neph.yaml format is quite bad. It lacks features that are needed for a powerful build system, and has some limitations.
Issues with current neph.yaml format:

  • users can't have a job named include, because it is used by neph
  • the include command specifies paths relatively to the process pwd, and not relatively to the dir which neph.yaml is in.
  • commands are always executed by sh, although some developers may want to use elvish, oh, or zsh to build the project
  • commands are splitted on newlines, although a command may contain multi line job (e.g. sh if/case statement), or multi line string
  • parsing is not easy (current implementation is very wrong - contains too much corner cases)

If you think my solution is a good idea, I will implement it. Please comment what you think about everything below. (The parser needs a rewrite even without these changes, but if it will be rewritten, implementing new features will be easier during the rewrite.)

My idea to solve this:

The neph.yaml file should contain two YAML documents: the first would be the config, the second would contain the jobs.

The config should contain:

  • Neph version. This could be used to inform people about syntax changes, or automatic converting into the new format.
  • Included files. This would do the same thing that the include keyword does currently. The files would be relative to the neph.yaml file.
  • Command interpreter. This should be an Array of String, containing the arguments for the interpreter. The %c sequence would be replaced with the command name. If the arguments don't include %c, it prints the command to the interpreter's STDIN. Examples:
    interpreter:
      - elvish
      - c
      - %c
      interpreter:
        - zsh
      # Prints the command to STDIN
  • The name of the default job. This could be used to give meaningful names to the default job, instead of main.

Changes to the document containing the jobs:

  • The command key (which is currently a multi line String) should be an Array of String. The interpreter is invoked for each element.
  • The value of the depends_on key should be an Array instead of the current String|Array
    (if the job has only one dependency, it isn't difficult to put a - before its name)
    If the output of the job have to be saved to an environment variable, it should be written as following:
    # Current
    depends_on:
      - job: sample_job
        env: VARNAME
      - other_job
      - job: another_job
        env: OUTPUT
    
    # New
    depends_on:
      - sample_job: VARNAME
      - other_job
      - another_job: OUTPUT
  • A job should have a description key (String), which would contain a short description of the job. If an option will be added to neph to list available jobs (--list), this could be used to construct output similar to a help message.
  • A job should have a private key (Bool). If a job is private, it can't be run with neph [job_name], only from depends_on fields (inspired from the private keyword in Crystal). The purpose of this is to keep the job listing output clean, and avoid accidentally launching the wrong job. The description and private properties should conflict, because private jobs won't show up in the output of --list.

It would not be interesting to think about other formats, like TOML for example, maybe things get simpler..

ref:
crystal-lang/shards#25
https://gohugohq.com/howto/toml-json-yaml-comparison/

neph.yaml files are mostly written by hand instead of generating by programs.
YAML is designed to be easily writeable/readable by humans.

I don't want to add JSON because it's hard to write by hand, and strings needs to be escaped (which would cause very wrong results: the shell commands also needs to be escaped, and some tools even evaluate escape sequences in command line arguments). And it don't support comments. And it is intended to be a data-interchange format, not a config format.

Maybe I will add TOML, and support both YAML and TOML.
The main problem is with the internal implementation and the structure of the config file, not the serialization language.

Sorry for the late reply.
OK, almost agree but I have a question for

The neph.yaml file should contain two YAML documents: the first would be the config, the second would contain the jobs.

Is the configuration file will be splitted? I think two splitted configuration file is bad from the view point of UX.
If you just mean that the configuration file contains two parts (it's a file), it's ok.

In YAML, a --- on an empty line splits the document into two separate document (but one file):

# This part contains the config.
neph-version: "0.2"
include: "../other_file.yaml"
interpreter:
  - "elvish"
  - "c"
  - "%c"
---
# This part contains the job descriptions.
main:
  depends_on: other_job
other_job:
  command:
    "true"

Multiple documents in a single file are parsed with YAML.parse_all, which returns an array of parsed YAML documents.

However, TOML doesn't seem to going to support separators, so maybe we have to store TOML stuff in two separate files, or refuse TOML support. toml-lang/toml#511

I suggest not worrying with Toml support. Yaml Is more than adequate.

@notramo I am not convinced by this singular opinion at all. I love YAML because 1) it has comments, 2) it's hierarchical, 3) it supports multiple documents in one, 4) it supports custom object types 5) it supports inheritance. Just from a purely visual standpoint, there is nothing that comes close to YAML in terms of expressiveness.

And yes, if security/safety is your concern, then please use safe_load. However, in most cases, it's a moot point: because if an attacker has ability to write to your configuration files, it doesn't really matter that they can execute arbitrary statements. They most likely already have a shell access, and therefore can run whatever they want.

The only time this security issue would be a concern, if YAML travels across the network as a fetched configuration, for example. In this case using safe_load is paramount. Better yet, use JSON as a network data format, but use YAML as a local configuration format.

Thanks.

@kigster TOML also has comments, and is hierarchical. What do you mean by custom object types and inheritance?
The main problem with YAML in the point of view of Neph is implicit typing.
I think the Crystal implementation don't evaluate stuff in YAML documents, so it's not a problem. Even if it does that, it shouldn't be considered in Neph, because the attacker could write malicious commands also into the job commands.
The config file structure proposed for YAML would be ugly and hard to write in TOML syntax, so I have designed a config structure for TOML, that could even solve the problem that TOML currently don't support multiple documents in one file. However, I would like to see that this feature will be accepted in TOML, and maybe I will then redesign the Neph TOML config to use this.

@notramo

  1. Custom Object Types

At least in Ruby, you can serialize pure ruby objects into YAML:

--- !ruby/object:C
a_object: &id001 !ruby/object:A
  number: 5
  string: hello world
b_object: !ruby/object:B
  a_object: *id001
  number: 7
  1. Inheritance

YAML supports inheriting elements and overriding inherited elements partially:

server_defaults: &server_defaults
  ip: 192.168.1.5
  port: 2000

user_defaults: &user_defaults
  name: root
  password: root

database: &default
  server:
    <<: *server_defaults
  db_name: test
  user: 
    <<: *user_defaults

foo_database:
  <<: *default
  server:
    <<: *server_defaults
    port: 2001
  db_name: foo
  user:
    <<: *user_defaults
    password: foo_root

@kigster Crystal supports YAML.mapping(), which is similar to custom object types, but Neph shouldn't use it, because this solution isn't able to generate usable Neph error messages if the document is wrong.

I'm keen on using YAML (in ROR) since mappings. I do not know about TOML, however I know that gitlab-ci use abusively mapping (in their yaml conf), so it's probably a not so bad idea

The different structure of the serialization formats require fully differently implemented parsers.

YAML config will be implemented first, then when it is complete, I will start implementing TOML config.

I didn't wrote that TOML will substitute YAML in Neph. I don't understand, why somebody argue against TOML. What's wrong with multiple supported serialization formats in Neph?

This issue is opened for discussing new ideas on the config file redesign. If you can't give better reasons against multiple serialization formats than personal preferences for a particular format, please don't comment about it. I don't want to see similar comments: I love X serialization format, and I don't care (or maybe even don't know) about Y, so please don't implement Y. Why not implement if you don't care about it? Do you worrying about the implementation of your favourite format won't get enough development time because of another formats?

By the way, I recently found SDLang, which looks cool for me. It is similar in structure to XML, but has human readable/writable syntax, and some additional cool thing. Maybe I will eventually add this along with YAML and TOML.

I think GNU Makefile is ❀️ because of simplicity (I'm however not keen on hard tabs).

For me the most idea of neph is the parallelism (spreading on all cores due to crystal).

YAML is cool, but I admit COULD be complex at a certain level.

Whatsoever formalism is choose, for me simplicity and emitting (creating configuration from an other language), in this https://github.com/vstakhov/libucl could also be relevant, but a core language formalism usage SHOULD be used.

@notramo sorry if I mislead you; of course you are free to implement whatever format you like and whatever number of formats. I somehow read the above as you were switching from YAML to TOML, but now I realize that’s not the case. I think both formats are valid and powerful.

@kigster No problem. :)
@waghanza By the way, Crystal currently don't have parallelism, but instead it has a very good non-blocking Process#wait, which enables launching multiple concurrent child processes (so these can utilize multiple cores), and waiting for them in the same thread which is used for displaying the progress on terminal.

I will announce in this thread when the branch is pushed to the repo.

@tbrand @kalicki @kigster @waghanza What do you think about removing the ability to include files?
I think it isn't much useful, but it introduces some problems if the included file can also have a config part.

  • The interpreter is defined in the config part of the document. What to do if the included file don't specify the interpreter, but the main file does? Should the included file inherit from main file, or use the default interpreter (sh) for the included file?
  • It will be possible to define environment variables in the config part that applies to the whole file. Should this setting apply to the included file also? What should it do if the included file also specify an environment? Use only the environment specified in the included file, or inherit the environment of the main file also?
  • There will be a default_job key in the config part, which will define the main job (default: main). What should it do if the main file don't specify the default job, but the included file has this key? Set the default job to this? What to do if multiple files are included, and have this key?

Should we disable these things in included files (only having one part, which contains the jobs)?
I think, this feature could be completely removed. Can anyone provide a use case, where this feature would be useful? I would like to remove it, because if the software is not bloated with unneeded functions, the code will be cleaner, and learning the usage (how to write Neph config files) will be easier also.

In my point of view

  • interpreter SHOULD be unique -> in the main document
    • the default interpreter is the os default, but for compatibility, it is useful to define another
    • defining an interpreter (non-default) insert a dependency
    • having several interpreters lead to several languages -> needed for one project ?
  • environment variables SHOULD be defined in one place
    • it could be a messy thing (no conventions)
    • their could be variable overload
  • default job SHOULD be named in one wording, main is OK for me
    • it is confusing / not-friendly to have several world for the same thing

@waghanza So do you think the config part should be disabled in the included files?

What do you mean by variable overload?
What do you mean by defining variables in one place? E.g. only in the config part, or for each job? I think it isn't comfortable. The variables defined in the config part would apply to all of the jobs. The job specific variables would be defined for a single job.

I think the advantage of specifying the default job is that you can give meaningful name to this job also. (e.g. binary or package instead of main)

@notramo

So do you think the config part should be disabled in the included files?

YES, for me config SHOULD be either empty (default OS capability) or defined, but in 1 place -> in order no to be confused with several configs (interpreter and else)

However, separate jobs in order no to have a huge file COULD be ok (for me)

What do you mean by variable overload?

What happens if var A is define in the main file and also define in an other config file, which var to use ...

For me, neph is a make but using a modern way (e.g : use most of computer resource by default)
⚠️ I am exaggerating

neph variables SHOULD be global, if we want to define function variables ... why no to create a specific app for this purpose

What do you mean by creating a specific app?
Setting the environment variables can be done in the shell commands of a job. But it can lead copy-paste programming if a job has e.g. 15 commands and each command needs these variables.

I mean that the job processing tool does not aim to replace a specific app 😜 (e.g : coded in whatsoever language)

For me the most advantage here are :

  • (a sort of) parallelism
  • a modern configuration formalism

I think setting environment variables for a single job isn't replacing a specific app.

Maybe removing include isn't a good idea, but configuration will be disabled for included files.

I think setting environment variables for a single job isn't replacing a specific app

I think using a convention is a good way to define for each job :

  • JOB1_... = ....
  • JOB2_... = ....
  • JOB3_... = ....
    to avoid a messy configuration πŸ˜›

Maybe removing include isn't a good idea, but configuration will be disabled for included files

Absolutely OK, configuration SHOULD be uniq, but define jobs (only) in several file COULD helps, in terms of readability

I think something like Caddyfile syntax, can be well organized, but would have to create from 0?!
E.g. https://caddyserver.com/docs/caddyfile

I think instead of adding TOML and YAML, together we can see what smaller we will have problems and curve implementation

Each current problem scenario has to be isolated and see how possible solution for both, otherwise I suggest a proper pattern like the Caddy Server

I think SDLang would be the serialization language that is able to produce the most flexible config format, but there isn't a Crystal parser for it currently. I am now implementing the YAML config.

One other thing to discuss:
Should the sub jobs be repeated when they are required by multiple jobs?
There is an example neph.yaml file (current format):

main:
  depends_on:
    - first
    - second

first:
  depends_on:
    - repeated_job

second:
  depends_on:
    - repeated_job

repeated_job:
  command: "echo line >> text.txt"

The text.txt file will contain two lines.
Should this behavior remain in the new version, or if a job is required by multiple jobs, then it should be only launched once?
What about adding a repeatable option for jobs? With this solution one file can contain repeated and also non-repeated sub jobs (my preferred solution).

@tbrand
I have pushed a commit to the config_redesign branch. It is the first step in rewriting the parser. Currently, only the parser part is done, it isn't able to run jobs.
This isn't the best solution for implementing the parser, but I were unfamiliar with parsers when I started rewriting this part of the program. It contains wrong solutions (but not as much as the previous version contained). However, it is robust (a lot more than the previous), and outputs quite usable error messages in pretty much every situation when there is an error in the neph.yaml file (e.g. current: "There is an error in the main file: Job names have to be strings.", previous: "Can't cast Int64 to String").
I know it was a long time ago when I started rewriting, and the progress is quite slow, but I will spend more time with developing it.

Cool. I 'll check.

Is it possible to open the PR with [WIP] or [DNM] tag?
I would like to see the diff.

I saw the commit roughly and you don't use YAML.mapping for the config.
Can we use YAML.mapping for the robust parser?

You can see the diff without a PR. See the commit details. (Basically every file is deleted and has been rewritten from scratch in src/, other files aren't modified).
We can't use YAML.mapping, because its error messages are unusable when writing a build file. The developer who builds his project with neph don't have to know its internal implementation to be able to debug the build file. Mapping exceptions provide unusable information for users, instead they provides information about the implementation.
The current implementation is very robust. The problem is that although it provides lot more details about build file syntax errors than the previous version, I want even more (it explains what is the problem, but provides very few details about the location of the error e.g. it is in the main file, in the definition of the job1 job). It is possible to debug the build file without knowing anything about the implementation, but it sometimes requires using grep, so it's not as user friendly as I want it to be. Another problem is that it's more difficult to hack on it, and is difficult to understand how it works (I will write a hacking guide for the project, after the rewritten version becames ready to run jobs with basic features.) It uses YAML.parse for parsing the files, and then does type checks on the parsed data structure, then constructs the jobs. The problem with this solution is that the data structure returned by YAML.parse don't contains information about the location of the parsed objects (filename, line, column). Maybe the parser should be integrated with YAML::Nodes, but I need to learn how the YAML parser constructs the objects so it will be done later. The current implementation is easily extensible, and it is possible to later rewrite again the parser without modifying other files (job structure, interface, etc.).

@tbrand It is now able to run jobs.
If a command fails, then the job is stopped immediately. All other jobs continue to run, but new jobs aren't started.

Cool! πŸ‘

@tbrand The next thing will be the CI output mode. After finishing that, it will be production ready, and it can be merged into master.

What do you think, what information should the CI mode monitor?

  • a job is started (and is waiting for sub jobs)
  • a job started its first command after waiting for sub jobs
  • a command is started/finished
  • a job is finished
  • should it has timestamps?

Maybe all of this should be printed? I think it is too much, however the slit pager is designed for these cases.

@notramo

That's awesome! πŸ‘
I think we should print all of them out at first release. And next we can add log level option into neph.
Because removing these logs according to the log level is much easier than adding new logs. (So it should be verbose on the first release.)

What do you think?

@tbrand
OK. Maybe the starting/finishing of each command is a little bit too much info, but as you wrote it should be easily configurable with verbosity level option.
I will have less free time next week, but maybe I can finish it during this period.

Build file format documentation is needed for the release. What should I write first, wiki or example build file?

I will open a PR in the which_is_the_fastest repo with the updated neph.yaml when the new release is out. If you know other public projects using Neph, please collect them here, so I could update their build file also.