constabulary/gb

gb vendor fetch: do not check out same remote repository for different import paths

Opened this issue · 1 comments

seh commented

When one runs gb vendor fetch , gb calls main.fetch to acquire, copy a portion of, and then discard its copy of the remote repository. After that, so long as its -no-recurse flag is false, it proceeds to fetch the missing transitive dependencies of the source it's acquired thus far.

The problem arises when one requests fetching of one import path from a repository that yields files that in turn import alternate paths within that same repository. Consider a hypothetical repository:

  • example.com/org/repo/.git
  • example.com/org/repo/p1
    • file1.go
package p1

import "example.com/org/repo/p2"

var P p2.Something
  • example.com/org/repo/p2
    • file2.go
package p2

type Something string

If one runs

gb vendor fetch example.com/org/repo/p1

then gb will fetch the repository example.com/org/repo, copy the p1 path within it, then proceed to fetch the same repository again, then copy the p2 path within it.

This doesn't matter much for small repositories, but for large ones it can take many hours, wasting bandwidth and churning the disk unnecessarily. Consider augmenting main.fetch to remember the set of repositories it's downloaded from its initial top-level invocation, and to destroy them all only when unwinding back up to the top-level. Intermediate recursive invocations could share that repository cache to avoid downloading the same repository more than once.

Yes, this is something I need to fix. It's not just inefficient, it's
actually wrong to cherry pick parts of a repo.

On Fri, Sep 23, 2016 at 5:32 AM, Steven E. Harris notifications@github.com
wrote:

When one runs *gb vendor fetch *, gb calls main.fetch
https://github.com/constabulary/gb/blob/master/cmd/gb-vendor/fetch.go#L84
to acquire
https://github.com/constabulary/gb/blob/master/cmd/gb-vendor/fetch.go#L103,
copy a portion of
https://github.com/constabulary/gb/blob/master/cmd/gb-vendor/fetch.go#L134,
and then discard its copy of the remote repository
https://github.com/constabulary/gb/blob/master/cmd/gb-vendor/fetch.go#L142.
After that, so long as its -no-recurse flag
https://github.com/constabulary/gb/blob/master/cmd/gb-vendor/fetch.go#L40
is false, it proceeds to fetch the missing transitive dependencies of the
source it's acquired thus far
https://github.com/constabulary/gb/blob/master/cmd/gb-vendor/fetch.go#L195
.

The problem arises when one requests fetching of one import path from a
repository that yields files that in turn import alternate paths within
that same repository. Consider a hypothetical repository:

package p1
import "example.com/org/repo/p2"
var P p2.Something

package p2
type Something string

If one runs

gb vendor fetch example.com/org/repo/p1

then gb will fetch the repository example.com/org/repo
http://example.com/org/repo
, copy the p1 path within it, then
proceed to fetch the same repository again, then copy the p2 path
within it.

This doesn't matter much for small repositories, but for large ones it can
take many hours, wasting bandwidth and churning the disk unnecessarily.
Consider augmenting main.fetch to remember the set of repositories it's
downloaded from its initial top-level invocation, and to destroy them all
only when unwinding back up to the top-level. Intermediate recursive
invocations could share that repository cache to avoid downloading the same
repository more than once.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#645, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAcAzh_fdDiGNKSxgQBm2tjPN3pAGd_ks5qste2gaJpZM4KERkt
.