yarnpkg/berry

[Feature] RFC: Telemetry

arcanis opened this issue · 12 comments

Describe the user story

As maintainer, it's sometimes difficult to know what we should prioritize. Are large monorepos the most common situation our users encounter? What packageExtensions are the most common? How many people opted-out to the nm linker? Etc.

Because of the lack of analytics, some projects also have trouble taking us seriously. A thread in the Node docker image recently suggested to remove Yarn from the Docker image, citing Yarn as a fringe tool. I don't have time to spend collecting the various polls from the surface of the earth.

Describe the solution you'd like

I propose we implement opt-out telemetry.

Homebrew is an OSX package manager with some level of analytics (they actually log more than what I have in mind for us: https://docs.brew.sh/Analytics).

  • Users would be anonymous. We wouldn't implement "client IDs".

  • Data would be stored on a third-party we don't own. In our case, something like Google Analytics would be perfect. On this point, I've investigated a bit Google Analytics and I'm not sure it's an option. The dashboards are very bare, and it doesn't seem to have good support for arrays, which would be necessary to support plugin and command names, unless we split it across dozens of calls. Perhaps Datadog would be a better fit after all.

  • Events would be aggregated, and sent weekly. We wouldn't be able to track anything with a lower granularity. As a result, telemetry wouldn't have any effect on CI.

  • Information about telemetry would be displayed on first install, together with a link explaining it in more details. Documentation would include a new page describing it.

  • A new yarn analytics off would disable it from all projects on the machine (on would re-enable it). Running yarn analytics show would print the information that would be sent.

  • The payload would be sent only during installs (not during run or anything else), in parallel with the regular install workflow (so it shouldn't have any significant overhead). Connectivity failures would be ignored and not cause installs to fail.

  • The information I propose we would track:

    • The Yarn version
    • Which command name is used (but not its arguments)
    • The active plugin names (only for our own plugins)
    • The number of installs run during the week
    • The number of different projects having been installed
    • How many installs for the nm linker
    • The number of workspaces
    • The number of dependencies
    • The packageExtensions field (name of extended + name of the extra dependency)

Describe the drawbacks of your solution

Telemetry is seen with an understandable amount of caution. Not helping, the project was once associated with Facebook, and it will be important to remind users that we don't have any particular link with it anymore. Using a third-party provider (such as Google Analytics) will also be a good way to guarantee that we don't collect unlisted data (such as IP, etc).

Describe alternatives you've considered

We could do without telemetry. Unfortunately, I think the lack of consideration we get from some entities is caused at least in part by the lack of metrics we can show them (helping us will have impact on X thousands of developers). Not having those tools require us to put more work into convincing them, which is exhausting.

I'm hesitant about it, but as long as the project is very transparent about it (perhaps notifying users on initial install of yarn?) then I think most will be okay (myself included)

The way I see it, the first install would print something like this:

❯ yarn/packages/plugin-typescript ❯ yarn install

➤ BR0000: Yarn will collect anonymous telemetry; consult this page for more information:
➤ BR0000: https://yarnpkg.com/telemetry (or run `yarn telemetry off` to disable)

➤ BR0000: ┌ Resolution step
➤ BR0002: │ babel-preset-jest@npm:24.1.0 doesn't provide @babel/core@^7.0.0-0 requested by @babel/plugin-syntax-object-rest-spread@npm:7.2.0
➤ BR0000: └ Completed in 0.24s

Since we would only send the data once every week, you'd then have seven days to disable it before the first (anonymous) payload is sent. And if you stop using Yarn before the seven days, it won't send any data at all.

Would the "once-a-week" thing be a general setup -- i.e. would EVERY install of yarn send data on say Saturday morning, or would it be individual from the first day of use? e.g. I use it on Monday the first, so by NEXT Monday the 8th, etc?

That would be the day of first install + 7. I find it fairer, and it's also better for us since we would get gradual information over the course of the week rather than all at once.

Ok. I can appreciate that; the server hit that you mentioned was going to be my next remark.

Data would be stored on a third-party we don't own.

Why? Isn't it better for the data to be owned by the Yarn project rather than a third party?

Why? Isn't it better for the data to be owned by the Yarn project rather than a third party?

My original line of thought was "if users know for sure that there's no chance we could ever get access to the raw data, they might trust us more". I'm not sure it would really change many things in practice though 🤔

I feel like people would trust it more if the server-side portion was open source, compared to using a third party closed-source system.

non25 commented

I've examined ~/.yarn/berry/telemetry.json, the content is mostly fine, but I feel you should omit saving directories.

      "enumerators": {
        "projectCount": [
          "/tmp/webpack-virtual-modules"
        ]
      },

I find that unacceptable and it makes me want to opt-out.

Also it contradicts with

Users would be anonymous. We wouldn't implement "client IDs".

I've examined ~/.yarn/berry/telemetry.json, the content is mostly fine, but I feel you should omit saving directories.

It doesn't send the directory. As you mentioned it would be really unacceptable. Instead, we simply keep the list of project path locally for bookkeeping purposes (to avoid counting the same path over and over again), but before emission we turn them into a number (note the field name: projectCount).

So in your example, we would send "projectCount": 1, not "projectCount": ["/tmp/..."].

I am not certainly sure, but it could be possible that the information which has been proposed to be collected can be enough to identify single entities. This might be something which has to be tested and/or checked with someone who has certain expertise with this to see if this is the case.

If these information would be sufficient to identify single entities, the GDPR propably will apply for users inside the EU. If this is the case, the telemetry would need to be opt-in, a privacy policy has to be published and kept up-to-date and a few other things which need to be managed. It depends if this will be less workload or not in summary. 🤔

(Personally I think opt-in telemetry is also a more friendly approach. It is more like: "Hey if you want to help us, you just need to enable telemetry for us to understand how you work with yarn!" instead of "If you do not want to help us, you can stop it anytime". But I also see that the opt-out approach has its appeal in that there might be more data which can be analysed)

Telemetry is seen with an understandable amount of caution. Not helping, the project was once associated with Facebook, and it will be important to remind users that we don't have any particular link with it anymore. Using a third-party provider (such as Google Analytics) will also be a good way to guarantee that we don't collect unlisted data (such as IP, etc).

There are a few more options which might be more trustworthy than Google Analytics and worth considering:

We have a thing called documentation:

By default, we don't assign unique IDs in the telemetry we send, so we have no way to know which data originates from which project. This setting can be used to force a user ID to be sent to our telemetry server. Frankly, it's only useful in some very specific use cases. For example, we use it on the Yarn repository in order to exclude our own usage from the public dashboards (since we necessarily run Yarn more often here than anywhere else, the resulting data would be biased).