sourcecred/operations

Setup a new repo structure for SourceCred

Closed this issue · 4 comments

At present sourcecred/sourcecred contains all of the development activity for SourceCred. This was convenient for rapid development progress during the initial development of the app.

There are some disadvantages:

  • It makes project setup more complicated (need to install sqlite for the GitHub plugin, even if you only
  • It encourages lax API boundaries
    • (e.g. the attribution flow only being runnable with the cred explorer (#967))
    • splitting the plugins into their own repositories would likely correlate to improving their API design
  • It makes the issue tracker more noisy; for someone with a particular interest, it's harder to find where to contribute

I propose the following repositories:

  • sourcecred/core
    • for the core Graph + PageRank
  • sourcecred/explorer
    • for the frontend explorer
  • sourcecred/git-plugin
  • sourcecred/github-plugin
  • sourcecred/website
  • sourcecred/attribution
    • for specifying in detail how the attribution algorithm works, tracking issues and improvements to it. eventually this repo should produce a spec/whitepaper for SourceCred, keep that up to date, and include a process for changing it
  • sourcecred/mission
    • catch all for all the non-technical / project management type work in SourceCred. tracking objectives, etc.

I propose the following repositories:

This sounds great in principle! I think that you’re spot-on that it will
improve the quality of the APIs.

Do you have any suggestions about how to implement this?

I kind of like the idea of importing all the Git history into each of
the core, explorer, git-plugin, and github-plugin repositories
and then deleting the irrelevant parts and setting up externs
appropriately. That way, we preserve the whole history faithfully so
that blame/log/follow/cred/whatever work in each repository, existing
commit references are still valid, and from a graphical perspective the
history looks like it forked, which is accurate.

Please do not use the “Transfer this issue” feature among any
sourcecred-organization repositories until we figure out exactly what
this does with respect to our GitHub ingestion pipelines. :-)

Do you have any suggestions about how to implement this?

Of the code repositories listed above, I think I'd like to start by pulling out the website. It is the most logically distinct from the rest of the SourceCred codebase, and because the interface is simpler, it should be the easiest major piece to pull out.

However, I've realized that there are some helper modules that we may want to pull out first: execDependencyGraph.js, util/null.js, util/map.js, util/compat.js. My inclination is to factor each of these out into their own small repository and module, which has the advantage that we can easily publish them on npm and make them easy to depend on in our other projects. (I've wanted null/map util for other js projects I've worked on.)

In that case, I imagine we'll actually use npm (well, yarn) to manage those intra-organization dependencies. Does that seem like a reasonable approach? Or is there a better path I'm not seeing?

I kind of like the idea of importing all the Git history into each of
the core, explorer, git-plugin, and github-plugin repositories
and then deleting the irrelevant parts and setting up externs
appropriately. That way, we preserve the whole history faithfully so
that blame/log/follow/cred/whatever work in each repository, existing
commit references are still valid, and from a graphical perspective the
history looks like it forked, which is accurate.

Looking at the case of util/null, util/map, it seems a bit silly to fork ~1k commits when we're interested in one and four of them respectively. For the micro modules, do you object to using git filter-branch to remove all the irrelevant commits?

For the larger pieces, I can see the benefit to just forking. Assuming (reasonably) that we want to maintain the ability to build and test each repo-module at every point in its history (for benefit of git bisect, etc) then that limits what we can prune out. For the cred explorer we basically can't prune anything. For the core module we could prune out the cred explorer but not the utilities. It seems simpler just to fork as you suggested. (It may have an interesting effect on the cred distribution, but that's our problem.)

Please do not use the “Transfer this issue” feature among any
sourcecred-organization repositories until we figure out exactly what
this does with respect to our GitHub ingestion pipelines. :-)

Ack.

I noticed with interest that Babel uses a monorepo; here's their design doc. The lerna docs reference several other well-known projects have taken a monorepo approach.

Seeing this makes me feel more cautious about splitting up SourceCred. (In particular, the Babel doc complains about the difficulties of synchronizing changes across multiple repos.)

There's very clear benefits to pulling the utils out (so I can depend on them elsewhere) and to pulling the website out (it's clearly a separate kind of project). So I plan to take those steps regardless. Afterwards:

  • Pulling the plugins into separate repo accomplishes an important goal of proving (to ourselves and others) that reasonable APIs exist for developing a plugin independently of the core codebase
  • Once the plugins have been pulled out, splitting core and explorer is actually pretty straightforward

So I'm cautiously still inclined towards the multi-repo approach, but we should pull out the site and util modules first and see how that turns out.

cc/ @mikeal and @daviddias who may have some thoughts/experience on monorepo vs multirepo in JS-land

(Haven’t read your two recent comments yet; this is just a follow-up as
my thoughts have simmered.)

Let me actually temper my enthusiasm for this. I think I got caught up
in thinking about how the implementation would work and didn’t think
about the bigger picture as much as I might have liked to. :-)

I still think that you’re right that this will improve the quality of
the APIs. In particular, it will force us to have proper boundaries in
reasonable places, define and document contracts and constraints, etc.
This is a great steady state, but I worry about breaking up the monorepo
before we have at least somewhat reasonable APIs here. Once we split up
the monorepo, it becomes significantly more difficult to make precisely
the kinds of changes that we want to make to improve these APIs: for
instance, moving modules across package boundaries, or atomically
updating multiple packages due to an API change.

Thus, it might be more prudent to put some effort into solidifying (not
necessarily perfecting) our APIs while we still have the monorepo
convenience.