multiformats/go-multihash

Generated go constants for multicodecs

Closed this issue ยท 16 comments

Hi everyone,

I saw multiple multicodec constant definitions in e.g. multiformats/go-multihash/multihash.go#L38 and multiformats/go-cid/cid.go#L52. Furthermore the generation of the blake/skein mappings are not obvious.

This motivated me to play around and try to automate the generation based on the "canonical" multicodecs table. The outcome can be found here:

https://github.com/dennis-tra/go-multicodec

The repo contains the multiformats/multicodec repo as a submodule. Every night a GitHub-Action updates the HEAD of the submodule to the most recent master commit, runs the constants generator, commits possible changes and creates a pull request. It will update the same pull request if subsequent runs find different changes and it won't create a pull request if no changes were detected (GitHub-Action: peter-evans/create-pull-request@v3).

Moving the constants to another repo would obviously be a breaking change for others who depend on this go-multihash repo so I see there are strong forces against a proposal like this.

But anyways, I just wanted to leave this here and ask if this would be something that you could benefit from?

rvagg commented

this is something @mvdan was considering doing and maybe has thoughts on this implementation

mvdan commented

Indeed, I've had an implementation in mind for a few weeks :) I started implementing it over the weekend, and I plan to wrap it up and push it later this week. The summary is that the API would be smaller and there would overall be less code; see multiformats/multicodec#199.

Oh I missed that issue multiformats/multicodec#199 - but I think we had a similar thought process :) Regarding your conclusion:

So it seems like the general recommendation should be to just expose a series of named constants for each of the codes. For example, in Go, we could simply code generate: [ ... ]

This is exactly what I've done in dennis-tra/go-multicodec/const.go. The other two files dennis-tra/go-multicodec/codecs.go and dennis-tra/go-multicodec/tags.go are just there to show other possible things one could do with such a generator approach.

While I think codecs.go makes sense to keep, the tags.go file may be superfluous. That's also speculated in multiformats/multicodec#199:

Similarly, categorizing the names or codes is not a good idea. The table has a tag field, but it doesn't mean that each of the codes fits neatly into a single category which will remain stable over time

If the tags.go file wasn't generated, the API surface would consist of the list of constants and two maps string -> codec and codec -> string.

I'm curious how you approached this and looking forward to see your solution :) I'd appreciate if you left a note here or tagged me so it won't slip through.

Edit: I just removed the tags.go generation -> https://github.com/dennis-tra/go-multicodec/tree/no-tags.

mvdan commented

Would you be up for adapting your implementation to the leaner approach I was thinking, so we can reuse your work? The bigger changes would be:

  1. Remove the git submodule, since it's a bit overkill. An HTTPS fetch of the latest CSV is plenty.

  2. You can simply use stringer to stringify constants, no need to roll out your own implementation. Plus, stringer generates better code.

  3. The constants look like JAVA_NAMES. For example, TORRENT_FILE should just be TorrentFile. This will likely require teaching the generator what words are acronyms or initialisms, like CID or DAG, which should not be spelled like Cid or Dag.

  4. I would remove the string->Codec mapping for now, because it's unclear if it's truly needed. If a package wants to understand some multicodec flag or argument, they should probably implement them too, at which point writing a string comparison is trivial. And if we want to add this "reverse String method" in the future, we should just do a Set, to implement flag.Value.

  5. I don't think we should avoid generating deprecated constants. Go has had a standard way to mark declarations as deprecated, so we could just do that instead.

I also wonder if the code generator is a tad over-engineered. You have two directories and five files, between Go code and templates. After the changes above, I bet you could end up with a rather simple generator in under 100 lines of Go in a single file.

rvagg commented

like CID or DAG, which should not be spelled like Cid or Dag

This might not be necessary, there's a ton of prior art of just using upper camel case regardless of acronyms, look no further than https://godoc.org/github.com/ipfs/go-cid for lots of examples. Up to you both but I doubt people will mind if it's not strictly correct. If you end up trying to be correct you could be up for a lot of work figuring out if each new case should be capitalised or not -- are you going to go with BLAKE2b rather than what an upper camel might give you since that would be technically correct? You'll end up doing a lot of teaching.

mvdan commented

Maybe. Standard CamelCase would be better than JAVA_CASE anyway :)

Would you be up for adapting your implementation to the leaner approach I was thinking, so we can reuse your work? The bigger changes would be:

Sure :)

  1. Remove the git submodule, since it's a bit overkill. An HTTPS fetch of the latest CSV is plenty.

While I'm not the biggest fan of submodules either I see some benefits in using it. Imho there is some added value in knowing which commit in the multicodecs repo induced the change to the list of constants. Just using GET to fetch the table makes the generator result unreproducable for older commits. Having the submodule clearly links the dependency between these repos.

  1. You can simply use stringer to stringify constants, no need to roll out your own implementation. Plus, stringer generates better code.

Initially I thought this is good idea, but Stringer will generate the literal constant variable names. Depending on whether we'll use CamelCase or SCREAMING_SNAKE_CASE the String() method will return just that. My understanding was that the name column of the table should be the corresponding string value for each constant.

  1. The constants look like JAVA_NAMES. For example, TORRENT_FILE should just be TorrentFile. This will likely require teaching the generator what words are acronyms or initialisms, like CID or DAG, which should not be spelled like Cid or Dag.

At first I haven't had the Java like names but the CamelCase ones (see this commit). Afaik this is also the recommended/idiomatic way of naming constants in Go (found the source). That's why I went for this approach first. For the camel case conversion from the name values I used the iancoleman/strcase package. However I ran into exactly the problems you @mvdan and @rvagg mentioned. The generator would need to learn acronyms to satisfy the initialisms guideline.

Having the additional dependency of strcase (I actually just copied the toCamel method), a potentially always out of date list of acronyms (auto-update wouldn't necessarily work if a new acronym were introduced) and the multihash package also using JAVA_CASE were three arguments for using the Java like names.

However I just saw that the initialisms guideline also says:

Code generated by the protocol buffer compiler is exempt from this rule. Human-written code is held to a higher standard than machine-written code.

So I'd suggest doing it similarly to the go-cid package and reverting my aforementioned commit. This would result in "blindly" converting the name to camel case without considering acronyms/initialisms.

  1. I would remove the string->Codec mapping for now, because it's unclear if it's truly needed. If a package wants to understand some multicodec flag or argument, they should probably implement them too, at which point writing a string comparison is trivial. And if we want to add this "reverse String method" in the future, we should just do a Set, to implement flag.Value.

๐Ÿ‘

  1. I don't think we should avoid generating deprecated constants. Go has had a standard way to mark declarations as deprecated, so we could just do that instead.

The generator is already generating deprecated constants and prefixes the doc string with // Deprecated: - is this the standard way you thought of?

I also wonder if the code generator is a tad over-engineered. You have two directories and five files, between Go code and templates. After the changes above, I bet you could end up with a rather simple generator in under 100 lines of Go in a single file.

That will likely be true :) thanks for your input.


EDIT: I just updated the no-tags branch and now some constants got names like Skein1024616 and PoseidonBls12381A2Fc1Sc which is hardly readable compared to POSEIDON_BLS12_381_A2_FC1_SC and SKEIN1024_616.

mvdan commented

unreproducable for older commits. Having the submodule clearly links the dependency between these repos.

You can't have truly reproducible unless you simply vendor/copy the table into the git repo. A git clone can fail in the future.

If you want to know what commit was used in a past go generate call, you could always include a commit hash in a comment in the generated code, though I'm not sure in what specific case this would be useful.

My understanding was that the name column of the table should be the corresponding string value for each constant.

See stringer -linecomment.

This would result in "blindly" converting the name to camel case without considering acronyms/initialisms.

Sounds good to me.

The generator is already generating deprecated constants and prefixes the doc string

Ah, gotcha, that sounds good. I misread the generator source.

some constants got names like Skein1024616 and PoseidonBls12381A2Fc1Sc which is hardly readable compared to POSEIDON_BLS12_381_A2_FC1_SC and SKEIN1024_616.

Here's a suggested heuristic: leave underscores untouched, and any dash between two digits must turn into an underscore. That way, bls12_381 turns into Bls122_381, and skein1024-616 turns into Skein1024_616. It won't be perfect, but it will be automatic and hopefully not require special cases.

You can't have truly reproducible unless you simply vendor/copy the table into the git repo. A git clone can fail in the future.

If you want to know what commit was used in a past go generate call, you could always include a commit hash in a comment in the generated code, though I'm not sure in what specific case this would be useful.

I just updated the repository and removed the submodule.

See stringer -linecomment.

Great, I didn't know this was possible ๐Ÿ‘

Here's a suggested heuristic: leave underscores untouched, and any dash between two digits must turn into an underscore. That way, bls12_381 turns into Bls122_381, and skein1024-616 turns into Skein1024_616. It won't be perfect, but it will be automatic and hopefully not require special cases.

Just tried that and it LGTM ๐Ÿ‘


That's the current state:
https://github.com/dennis-tra/go-multicodec

Hi @mvdan and @rvagg, have you had a chance to look at the changes?

mvdan commented

Yes; it's just not at the top of my TODO list :)

Are you still okay with transferring this code over to https://github.com/multiformats/go-multicodec? My plan is to remove all existing code there, keep the LICENSE and some parts of the README, add the new implementation, and tag v0.2.0 when done. I'd like the new implementation to be a single squashed commit, just for the sake of keeping the history clean.

You should remain the author of that single commit, since you wrote the software anyway. It would be best if you could squash all your work in a single commit with a good commit message explaining all that you've done, so that I can then cherry-pick that.

The only other detail that comes to mind is having me listed as the maintainer of the repository/module, which I think is fair given that I'm employed to take that kind of responsibility :) You would of course be welcome to open issues and contribute PRs, and I would review them.

As for the implementation, here are some more nits to polish the code a little more:

  • Use gofmt -w codec.go instead of go fmt ./..., since it's simpler and does less work.
  • I still think it would be best to join all of gen/ into gen.go, with the template being a multiline raw string literal. You can add a // +build ignore to ensure it's not built as part of the package, and run it via go run gen.go.
  • The constant godocs should comply with the standard (start with the name and end with a period) and be human-readable. For example, instead of // multihash: raw binary, you could do // Identity is a codec with tag "multihash" and described by: raw binary.. The description part can be omitted if the csv entry has none.
  • I still see some constant names which could lose their underscores; for example, Bitcoin_Tx could be BitcoinTx, and Holochain_Key_V1 could be HolochainKeyV1.

This is all stuff I could do via a PR after we've moved the code to the main repository, though, so it's up to you if you want to do this yourself now or leave it to me to do later.

Yes; it's just not at the top of my TODO list :)

Sure thing, I had your fast replies from last week as a reference point ;)

Are you still okay with transferring this code over to https://github.com/multiformats/go-multicodec? My plan is to remove all existing code there, keep the LICENSE and some parts of the README, add the new implementation, and tag v0.2.0 when done. I'd like the new implementation to be a single squashed commit, just for the sake of keeping the history clean.

You should remain the author of that single commit, since you wrote the software anyway. It would be best if you could squash all your work in a single commit with a good commit message explaining all that you've done, so that I can then cherry-pick that.

The only other detail that comes to mind is having me listed as the maintainer of the repository/module, which I think is fair given that I'm employed to take that kind of responsibility :) You would of course be welcome to open issues and contribute PRs, and I would review them.

The work I've done here is no rocket science but still having my name somewhere would be nice. Your plan captures that :) ๐Ÿ‘

As for the implementation, here are some more nits to polish the code a little more:

  • Use gofmt -w codec.go instead of go fmt ./..., since it's simpler and does less work.
  • I still think it would be best to join all of gen/ into gen.go, with the template being a multiline raw string literal. You can add a // +build ignore to ensure it's not built as part of the package, and run it via go run gen.go.
  • The constant godocs should comply with the standard (start with the name and end with a period) and be human-readable. For example, instead of // multihash: raw binary, you could do // Identity is a codec with tag "multihash" and described by: raw binary.. The description part can be omitted if the csv entry has none.
  • I still see some constant names which could lose their underscores; for example, Bitcoin_Tx could be BitcoinTx, and Holochain_Key_V1 could be HolochainKeyV1.

This is all stuff I could do via a PR after we've moved the code to the main repository, though, so it's up to you if you want to do this yourself now or leave it to me to do later.

I've covered the first three points and left the last one to you, because of the challenges with having special cases mentioned earlier.


You can find the commit in the transfer branch or here: dennis-tra/go-multicodec@5b6d55a

mvdan commented

Thanks! Looks good. I think we can leave out .gitignore, because it's not really needed for a vanilla Go repo.

I'm also thinking that it would be best to leave the github action out of the first version, because I feel like it's a potential security concern - a third party action has access to write to our repo, and since we follow a tag/branch the upstream could run arbitrary code with those credentials. And it's really not all that necessary; the table should only update every few weeks, and the cost of updating is a simple go generate && git commit -m '...' && git push.

I'll get to reviving the repo this week, thanks for your help!

mvdan commented

My first attempt is at multiformats/go-multicodec#37, after unarchiving the repository and closing all previous issues and PRs. Let me know what you think. We should probably close this issue and continue over there.

mvdan commented

The above has been merged, thanks @dennis-tra! I'll continue with PRs over there.

Thanks @mvdan, I may have chimed, if I had seen the conversation there - great to see it merged :)