Add the ability to distinguish unique installations to anonymous usage report

Question

Add the ability to distinguish unique installations to anonymous usage report

Opened this issue a month ago · 5 comments

Feature Description

Problem Definition

k6 currently collects anonymous usage information as part of its opt-out usage report (--no-usage-report). This report helps us understand k6 usage patterns to improve the tool and guide development decisions.

To better support the k6 development process, we would like to measure the number of active installations of k6 over time.

This requires the ability to track when a given installation of k6 (on a machine that has not opted out of the usage report) was last used. Since each usage report already includes a timestamp, the only additional functionality needed is a mechanism to distinguish one installation from another.

Considerations

The identifier introduced to enable this functionality would:
• Be anonymous.
• Be stored locally on the machine running k6.
• Be included in the usage report only if telemetry has not been opted out.

This identifier:
• Will not contain any personally identifiable information (PII) or system-specific data (e.g., username, hostname, IP address, etc.).
• Will comply with GDPR and other relevant privacy laws by being designed to avoid user identification and to remain strictly anonymous.

Risks

Using a random identifier and storing it might cause tampering risks. We should not trust data that the user can provide. As such, the identifier being reproducible/verifiable by k6 before submitting would be a nice to have.

Why This Matters

Having a reliable measure of active installations will:
• Allow us to make more informed decisions about features and improvements.
• Help us better understand k6’s reach and growth while respecting user privacy.

Inspiration & References

ID generation

machineid
In a previous role, I was exposed to a similar need, and we used a system fingerprinting mechanism that created a hash for the user system. We had the ability to verify this fingerprint, but the hash itself was cryptographic and thus non-reversible.

Proposed solution(s)

TODO

Already existing or connected issues / PRs (optional)

#4038

Answer 1 · 2024-11-27T14:48:05.000Z

For context, the Alloy project uses a UUID they call a "seed". This seed is saved on disk on the user system as a "seed file".
See https://github.com/grafana/alloy/blob/cc383c1edf988fd4763582c86a2e4b85bcc0f055/internal/alloyseed/alloyseed.go.

cc @joanlopez

Answer 2 · 2024-11-27T16:34:39.000Z

The most challenging part I see here is to consider what you @oleiade included in the risks section, especially considering that this is an open-source project, which makes it harder to keep some secrets unrevealed.

However, I'm not sure quite sure it does really worth, because as of now we're not doing anything to prevent fake data at the report level, and I see this just a subcase of that.

Do you have any particular idea on how to solve this?

Answer 3 · 2024-11-28T08:29:28.000Z

@joanlopez I have a couple of ideas, but I don't necessarily think any of them are worth the hassle:

We could sign the UUID we generate with public/private key pair. That would involve a bit of infrastructure work I don't even is possible. But that would work.
We could use Hardware+Environment related informations we bake into the identifier: id = sha256(os_version + architecture + UUID + salt) (or something along those lines) that way both k6, and the usage report receiver are able to verify the hash is a somewhat reliable way.
A bunch of other ways I don't necessarily think are worth it.

In general I don't really think any of those are worth it? Would you agree?
The core risk can also most likely be statistically mitigated, by correlating with other information we collect in the usage report, too.

Answer 4 · 2024-11-28T10:30:05.000Z

In general I don't really think any of those are worth it? Would you agree? The core risk can also most likely be statistically mitigated, by correlating with other information we collect in the usage report, too.

Yeah, I agree, as I said before. Indeed, if we want to implement any sort of hashing checks, I'd probably suggest to do it for the whole payload, and not only for this concrete field, cause any part of the report could be altered.

The problem is that there's probably not really safe and cheap/easy way to do so. Just for the id, it's true that for instance your first suggestion would probably work, and would be mostly safe, but still I have serious doubts about it really worthing it because of the aforementioned reasons.

Answer 5 · 2024-11-28T10:32:07.000Z

I agree. My preference would be for adopting the same UUID approach as Alloy, and address any issues as they occur incrementally 👍