Setup monorepo with lerna, account library integration
holgerd77 opened this issue Β· 35 comments
Being in discussion for some time there is now an agreement to integrate the following libraries here by setting up a monorepo with lerna similar to the setup in the ethereumjs-config:
We'll start with integrating the account
library due to its small size, low direct usage level and proximity to the VM logic, integration of the other libraries should be handled as a separate issue.
This has to go along with an update on the ethereumjs-config library (libraries) to be able to distribute the respective configurations to a monorepo-style consuming repository. This effort shouldn't be underestimated and will likely be as much work as the direct integration here.
With the stable Istanbul being released, I think it's time to re-start this conversation. I had two specific discussion-points in mind, namely:
- Do we keep the library names as is, or rename and use the scoped naming, e.g.
@ethereumjs/account
- Do we use different or the same version for all modules in the monorepo?
Re using the same version: I think this has some advantages. To achieve it we have 2 options as I see it. If we rename libraries we can safely reset the version. If we don't rename, we can upgrade all libraries to the biggest version among them (e.g. if VM has v5.x and all other v2.x, we upgrade all to v6)
Yes, I also thought that it would be very much the time for the mono repo implementation. We can also very much do this step-by-step and on a first round e.g. keep everything "as is" so the VM integrating all the modules it currently does and just setup the outer mono-repo structure and maybe do a first bug fix release for testing the new structure (just a first thought, not sure if exactly this way makes sense, but generally: we can take on this one step after the other).
@s1na on your questions:
-
I would have very much the tendency to take on the occasion and switch to a scoped naming here, this would generally make very much sense for the whole
EthereumJS
org libraries (over time) -
I lack very much the "feeling from experience" here. I would assume the biggest disadvantage here would be that one is enforcing e.g. major versions on all mono repo-associated libraries, even if there is no or just a subtle change in the library itself. This would very much affect single users of the respective library, since they eventually won't get a bug fix which would be otherwise be distributed by a separate bug fix release. Now people would be enforced to manually do the major version upgrade. How would you respond to this criticism?
I would assume the biggest disadvantage here would be that one is enforcing e.g. major versions on all mono repo-associated libraries
This would definitely be the downside. The only approach by which we can avoid it is to both keep the names (i.e. no scoped renaming) and the versions. I guess it comes down to a pro/con discussion. To bring a pro for using the same version I'd refer to the discussion in ethereumjs/ethereumjs-block#74.
@alcuadrado curious to hear your thoughts as well.
I'd like to add one more topic for discussion: what should be the ideal test structure for the monorepo?
Here's a brief analysis of the current situation:
platform | # of jobs | latest job duration | |
---|---|---|---|
ethereumjs-vm | circleci | 11 | 13m 13s |
ethereumjs-blockchain | travis | 5 | 3m 6s |
ethereumjs-block | travis | 9 | 4m 12s |
ethereumjs-tx | travis | 9 | 3m 46s |
ethereumjs-account | travis | 5 | 3m 31s |
ethereumjs-common | travis | 4 | 1m 45s |
Both CI platforms that we use still don't have a tailored support for monorepos. I see some approaches:
1. Build all 43 jobs, for every commit
This is the most straightforward way to go, but it is also resource-heavy. It strictly depends on a robust parallelization plan on the CI side, otherwise, we'd have an unpleasant job queue to cope with.
- find out what's our current plan capacity.
2. Only trigger jobs for packages that had changes (π)
In this scenario it would detect which package got changes and run only their jobs. That can be accomplished with a single test runner that uses a combo of lerna diff
and CircleCI API to trigger a job.
The big no
for me is that the tests would still be as siloed as they are today, where you need to release one package to see it failing on a downstream test.
3. Run tests by affected package
We'd use their explicit dependency relation to determine which jobs need to run. I still don't know how elegant this solution would be on the CI side, but I'd be happy to explore this path more.
How to read the image:
- Changes to
vm
should run their own 11 jobs - Changes to
block
should runblock (9)
,blockchain (5)
andvm (11)
= 25 total
- Changes to both
blockchain
andaccounts
should run their own tests +vm
In any case, it seems primordial to migrate them over to circleci
.
Edit: Added ethereumjs-common
to the table and diagram
Great analysis @evertonfraga, and a good point!
I'd also go either with 1 or 3. But yeah as you mentioned as it stands 1 doesn't seem very feasible. I do think however it might be possible to reduce the runtime of some of the tests, making option 1 more feasible. E.g. #536 was a low-hanging fruit I found a while ago which improved runtime quite a bit.
I think it'd also make things a lot simpler if the projects were using a similar toolchain (and CI). This should be a good preparatory first step. ethereumjs-account
can already started to be integrated here though I think.
I like the idea of doing lerna in independent mode and discovered semantic-release.
Ryan brought this up and so I just started thinking about further library integration here (also inspired by the visualization done above by @evertonfraga): in past discussions we had some broader consensus to not include the merkle-patricia-library
here, I would also be skeptical about the ethereumjs-util
library. However I think the Common
and Testing
libraries might be relatively indisputable candidates (Update: Testing
shouldn't be included due to repo size, see comment here)?
Another topic: without further reflection on pros and cons on this I would also have a tendency to independent mode as suggested by @ryanio.
In a monorepo world it might make sense to include util so everything remains under the same umbrella and we receive the full benefits of monorepo-ing.
Would it be too much to nest MPT under util?
I'm really excited that this migration is happening π€©
- Do we keep the library names as is, or rename and use the scoped naming, e.g.
@ethereumjs/account
Renaming the packages will prevent users to get updates. If I'm using, for example, ethereumjs-tx
and bug fixes start to be published as @ethereumjs/tx
, I won't get them.
This would only happen once, so it may not be that important.
- Do we use different or the same version for all modules in the monorepo?
I think I'd use different versions with independent mode, as @ryanio suggested.
@holgerd77's argument about unnecessary major versions and bug fixes is similar to what I just wrote about renaming the packages, but it would be much more frequent so way more important. That alone is enough to use independent mode IMO.
I understand your motivation and share your concern, @s1na . Keeping separate versions will probably lead to situations like ethereumjs/ethereumjs-block#74 again.
I believe the root of the problem is that npm dependencies are handled incorrectly in most of the libraries. Things like -account
, -block
, -blockchain
, -tx
, bn.js
, and possibly others should be peer dependencies most of the time. I really like this article that explains when peer dependencies are needed. I think we all had a short conversation about this in Osaka.
Changing the libs to peer dependencies would be a big breaking change, so I'd keep it outside of the scope of the migration.
Great job @evertonfraga ! I think option 3 is the most interesting, but it may be hard to implement. @ryanio's PR using Github Actions on the -blockchain
repo shows some very promising performance improvements, so I wonder if using actions to run all jobs in the monorepo on every PR won't be performant enough.
We can also very much do this step-by-step and on a first round e.g. keep everything "as is" so the VM integrating all the modules it currently does and just setup the outer mono-repo structure and maybe do a first bug fix release for testing the new structure (just a first thought, not sure if exactly this way makes sense, but generally: we can take on this one step after the other).
Also, I think this is a very good way to approach this. There are lots of things to migrate and config, so doing it gradually makes sense to me.
Ok just a summary on how the consensus seem to form on how we proceed here (correct me or continue the discussion if you have strong opinions against any of the points):
- We use independent mode as Lerna setup and keep on with current version numbers
- We do the renaming to scoped packages like
@ethereumjs/vm
- We use GitHub actions as a unified CI tool (is this already consensus?)
- We do a gradual step-by-step integration, first PR on this would be as described above (only VM, outer structure)
- For testing in doubt we can start with 1. (build all jobs) to keep things simple and expand on this later on further package inclusion
- Rename all git tags to avoid naming conflicts
Anything I forgot (feel free to directly edit the post)? π
My assessment to Github Actions: π
Pros
- Performance really seems to be superior.
- Actions (or jobs) in separate files is beneficial to a monorepo structure.
- It has a marketplace with a growing community, so reusing bits is super easy.
- It goes beyond the traditional workflow of (git push, run job). It can interact with PRs, and do a lot of things.
- last, but not least, it is integrated to Github and the UI is more responsive than Travis and CircleCI.
Concerns and "non-concerns"
- The
ethereum
org is still on the legacy billing structure, so they currently face a challenge there to use Actions. As our repos mostly live underEthereumJS
, can we take advantage of the free builds for GH Actions? - I'm not too concerned with vendor lock-in, as there's a low cost to move over to any other CI tool.
I wonder if the cache limits of github actions won't be a problem. They cache up to 2gb per repo, and node_modules
can be huge. Mor info here
Thanks @alcuadrado, that is indeed an important aspect to keep in mind for the evaluation.
I found this thread: actions/cache#6
It has some interesting discussion, including one user seeing success with running a 700mb node_modules and the cache compressing the archive. Iβll have to give the thread another closer look in relation to our specific needs to extract more conclusions.
I can try to investigate these limits further and if we may find them constraining.
2GB per repo sounds extremely low for a monorepo use case like ours, especially with [vm, account, block, blockchain, tx] all running their own checks in matrix across 4 node versions on every PR push.
From the top of your head, which ci steps / tests / builds / outputs / logs do you find being on the larger side?
I also found Persisting workflow data using artifacts.
As a separate point to add to the list of considerations, how do we feel about semantic-release? I think it would be nice to have a fully automated release process.
@ryanio Can we have this as a separate topic, eventually just open a new issue on that? This can be discussed relatively separated from this issue (and also might need a bit longer and thoughtful discussion) and there is a bit the danger to overload the discussion here.
@holgerd77 Yes great point, have started it here: #625
As a hands-on way to understand the requirements to the migration, I started to make this shell script monorepify.sh.
This can be useful to rehearse the migration and get valuable information ahead of time (eg: would GitHub actions work with X jobs? What about their cache limit?). Also, this automation tool could be in fact used to perform the changes we need π.
The migration can be run for vm
repo only (just comment line 11, might have some quirks), with any combination of the other candidates to the monorepo (edit line 11), or all of them (just run).
Roadmap
Edit: see this project for related tasks: https://github.com/orgs/ethereumjs/projects/1
I'll use my own fork as a playground, and you can inspect it here: https://github.com/evertonfraga/ethereumjs-vm/
Good news: GitHub actions make it trivial to implement my proposed solution (3). Weβll have to declare some exclusion paths to each workflow file, so the irrelevant builds are not ran for each push
.
Nice script @evertonfraga! You may want to install lerna in one of its first steps, as it has this very handy command https://github.com/lerna/lerna/tree/master/commands/import
Should we create a new repo for the monorepo? This way we can leave all the different library-specific repos as they are now and archive them.
Thanks, @alcuadrado! I was uncertain at first to use an importer like lerna's, as it rewrites all commits as if the files were always under the new file structure. But maybe it's the best way forward.
Re: new repo for it β I lean towards to keep ethereumjs-vm
reputation (stars) and network (forks). It can be renamed to ethereumjs-monorepo
in the end.
On the caching question:
Just had a look at the VM node_modules
folder, this is roughly 1 GB uncompressed and 95 MB compressed (which counts). This should be one of the significantly larger ones having ethereumjs-testing
with its huge submodule folder as a dependency.
What is with open PRs on the other repos (but eventually also on the VM?) containing a significant amount of work? These should likely (optimally) be merged before a transition (hope this is realistic)?
PRs on the other repos
Sure, good point. This is something we should perform some manual tests and create a playbook of.
Monorepo Transition PR Analysis
Some analysis of the currently open PRs on the repos.
Feel free to edit the post. I would suggest not to include the monorepo working PRs (e.g. on the GitHub actions like ethereumjs/ethereumjs-account#59) to not clutter the list too much.
Account
No significant open PRs.
Block
- Muir Glacier PR, ,, @evertonfraga β Merge before or resubmit after transition
- v3.0.0 Release, ethereumjs/ethereumjs-block#74, @holgerd77 β Merge before or resubmit
Blockchain
- Test Refactor PR, ethereumjs/ethereumjs-blockchain#134, @ryanio β Merge before (better, nearly done) or resubmit
Transaction
- Simplification Experiment, ethereumjs/ethereumjs-tx#154, @alcuadrado β open state, leave and ask for resubmission if picked up again
- Test for verifySignature, ethereumjs/ethereumjs-tx#105, @fanatid β rather small PR, open state, leave and ask for resubmission if picked up again
Common
- Muir Glacier, ethereumjs/ethereumjs-common#75, @evertonfraga, β Merge before or resubmit
VM
- All recent open PRs (there are some which just need to be closed) are from @s1na or @alcuadrado β Can be asked for resubmission if picked up again
Summary: Ok, PR situation is actually easier to handle than I initially thought. Basically for the moment we are fine and there is some defined path to go for all open PRs. π
Our goal is to have a smooth migration, without surprises or rushed changes because something was not considered first place. My approach is mostly programmatic, as IMO doing every step manually can be error-prone.
I am using a project board to keep track of all requirements to accomplish that. Please take a look and see if I am forgetting anything.
https://github.com/orgs/ethereumjs/projects/1
Feel free to add new cards, and if any of those are worth discussing, feel free to convert to an issue (Click on card β¦ > Convert to Issue).
@evertonfraga That's really cool, thanks for setting this up! π
Hi all,
I need more context regarding the use of ts-node
in our projects in order to proceed with the concerted test efforts using Lerna. I see there are some different ways to spawn the tests across the projects.
1. account
, block
, common
and tx
They all pass the main test files to ts-node
, naturally don't requiring dist/
files to be present
ts-node node_modules/tape/bin/tape ./test/index.ts
2. blockchain
It first builds dist/
files, then runs ts-node requiring modules from dist/
.
npm run build && ts-node node_modules/tape/bin/tape ./test/index.ts
3. vm
I assume this is outdated code, since that currently the tests don't work without the --dist
flag.
npm run build:dist && node --stack-size=1500 ./tests/tester --blockchain --dist
Now I wanted some feedback: should we normalize the libs to run with ts-node
, without the need to build the files first?
I have to say that this would help quite a lot on the CI side.
I've always found the vm setup somewhat frustrating, as I tend to forget to run build:dist
, because I normally run all my tests with ts-node
.
Thanks for the feedback @alcuadrado.
The VM npm scripts used to let you test both the code and dist, with the βdist
flag, but it lost that capability with the TS migration.
Iβm taking the ts-node
route that will restore that capability, and will help a lot with CI, so I shouldnβt need to move dist files across jobs for integration tests.
Some update on Greenkeeper, this is activated for many repositories (see ethereumjs/ethereumjs-account#55 for an exemplary PR) to notify about dependency updates (just had a look at the setup, not in a consistent state though, some repos missing, some not activated):
First, monorepos seems to be supported.
To ease the migration I've now disabled Greenkeeper to stop opening new PRs, I will also close existing open PRs on the respective libraries. We can then re-activate once the transition is done (or eventually also considering switching to another tool, seems you also have some alternative solution here on the Grid side, don't see any pressure for a switch though in general and have no comparison overview. Very much a side topic I would say, but anyhow).
We started using RenovateBot for Grid repositories. It had lots of granularity for configurations, but in the end I had no special need with the array of settings for it.
I had planned to implement Greenkeeper in the monorepo already, since it's already used by the target repository (ethereumjs-vm) and it supports multiple package.json
. PRs can be grouped by every new dependency, which makes everything cleaner.
I have activated Greenkeeper today in our monorepo playground: https://github.com/evertonfraga/ethereumjs-vm/
Mind that Readme linked above is not yet complete, but it's already a good opportunity for feedback.
@evertonfraga whuah, have to admit that I didn't look at this monorepo playground yet you created. This already looks fantastic! π
"This branch is 1305 commits ahead, 31 commits behind ethereumjs:master."
Lol. π€ͺ