Parse results not deduplicated

Question

Parse results not deduplicated

Closed this issue 8 months ago · 1 comments

Overview

Repeat entries in dependency files are included in parse results as many times as they are repeated.

How To Reproduce

Steps to reproduce this behavior:

❯ phylum --version
phylum v5.8.0

❯ cat requirements.txt
pyyaml==5.3.1
pyyaml==5.3.1

❯ phylum parse --lockfile-type pip requirements.txt
[
  {
    "name": "pyyaml",
    "version": "5.3.1",
    "type": "pypi",
    "lockfile": "requirements.txt"
  },
  {
    "name": "pyyaml",
    "version": "5.3.1",
    "type": "pypi",
    "lockfile": "requirements.txt"
  }
]

❯ phylum parse --lockfile-type pip requirements.txt | jq 'length'
2

Expected Behavior

Repeat entries are only included once in parse results.

Additional Context

Maybe this is less of a bug and more of a feature request. However, there isn't a use case that comes to mind where having repeat entries in a dependency file is intended. Reporting on the number of dependencies, based on the parsed output, also seems wrong if repeat entries are counted.

#1274 is a related issue where repeat entries are captured due to not normalizing package names.

Answer 1 · 2023-11-02T20:33:52.000Z

I don't see any reason to deduplicate these results. The API does deduplication automatically when submitting a job, so this does not break anything.

And the only file where this is remotely likely to happen is requirements.txt because it is a ~~abomination~~ lockifest. Normal lockfiles and manifests don't have this problem because the dependency tool does deduplication when creating the lockfile.

[T]here isn't a use case that comes to mind where having repeat entries in a dependency file is intended.

This further makes the point. If there is no use case for it, there is no reason to fine tune our handling of it. The current behavior does not break anything.

Reporting on the number of dependencies, based on the parsed output, also seems wrong if repeat entries are counted.

Reporting on the number of dependencies is not a feature of phylum parse. Anyone using this output for that purpose can easily do the deduplication themselves. At that point, they can also handle identical dependencies that come from different lockfiles and other edge cases that may concern them.