Our dependency parsing is sometimes too naive
Closed this issue · 2 comments
Currently
Cljdoc-analyzer naively uses jsoup (an HTML/XML parser) to parse the pom to learn about an artifact's dependencies.
Scenarios this can fail include:
- pom properties (ex. version is
<version>${clojure.version}</version>
- managed dependencies (ex. version is specified under
<dependencyManagement>
but not under<dependencies>
) (actually we do naively try to handle this). - parent poms are not considered
Example problem: cljdoc/cljdoc#742
Analysis
We already use clojure tools deps to resolve dependencies.
It could resolve all of the above.
Except for one case.
Clojure tools deps, by design, filters out everything but runtime dependencies.
Cljdoc currently includes provided
scope
deps as a way for a library to specify optional dependencies. These are, by design, excluded by clojure tools deps.
Specific usages
The cljdoc analyzer currently uses naively parsed <dependencies>
to:
- add pom
provided
deps to the classpath (it also, btw, includessystem
andtest
scope dependencies and deps marked asoptional
, I'll refer to all of these asprovided
) - remove problematic deps
- those with an artifact name starting with
boot-
(not sure the history of this one, but whatevs for now) org.clojure/tools.reader
to avoid conflicting with cljs tools.reader
- those with an artifact name starting with
- ensure key deps use recent versions, if not, automatically bump them
- ensure key deps are present, ex.
javax.servlet/javax.servlet-api
is always included because some enough libs assumed it was provided.
In all the above adjustments, we currently only look at deps specified in the analyzed lib pom.
We don't look at transitive deps.
Thoughts
We could simply take advantage of clojure tools deps resolving support.
Except that it does not include provided
scope dependencies.
Scope Provided deps
Option 1: Hack tools deps to return provided deps
There might be just a couple of spots to alter-var-root
.
Con: brittle, we'd be hacking tools deps internals and those are subject to change
Pro: might be easy in the near term
Option 2: Use pomegranate
Con: Deps resolution differs from tools deps, but we could use it solely to discover provided
deps.
Former con: lib was stale with CVEs, but I recently did a maintenance pass to resolve that.
Pro: We write less code
Neutral: Depends on maven resolver libs, but so does clojure tools deps
Option 3: Use maven resolver directly
Pro: no extra deps
Con: java interop
Con: might be just duplicating something quite small that works already (pomegranate)
Current choice
I think I'll explore option 2 first and see where that leads.
Transitive deps?
Should we be looking at these too? (see "specific usages" above).
For provided
deps we probably just want to check non-transitives.
For other deps fixups we could fixup transitive runtime deps, but we'll start with current behaviour of only looking at non-transitive deps.
Future-features
I'm not going to worry right now about our plan to someday build docs from git-sources and how decisions for this issue might impact that feature.
Upstream usage
Does the cljdoc web site want/need any of this information?
The work could be saved for cljdoc if it needs it.
We do save the raw pom
and I do see naively parsed dependencies loaded by cljdoc... but I don't think this data is currently used.
Notes from an exploration
I've learned that pomegranate does not return provided
deps by default.
We can twist its arm to do so, but I'm going to take another look at tools deps.
I am seeing some code that looks like a closer fit for cljdoc's current algorithm/strategy.
In clojure.tools.deps.extension.pom
(which is internal, it is marked with :skip-wiki
, which is the same as :no-doc
) we have:
read-model
which reads a pom (not naively)model-deps
returns non-transitive runtime deps for a pom
I could:
- take the risk of using the internal
read-model
as is. - and bring over
model-deps
code, but modify it to return all non-transitive deps so we can get atprovided
deps
This exploration is a variation of option 1.
It might be fine for a first go.
If we learn it is too brittle moving forward, we can adapt.
I could question cljdoc-analyzer's current strategy of reading the pom, but I don't think I'll do that at this time.
More Notes - provided exclusions
Things are starting to work locally.
One interesting thing is that our naive parsing of the pom did not pick up exclusions for provided deps.
Will dig deeper, but I think this is why http://repo.clojars.org/metosin/compojure-api/2.0.0-alpha27/compojure-api-2.0.0-alpha27.pom is failing for me locally. It excludes commons-codec/commons-codec
, but ring.util/codec
get loaded during analysis and fails because org.apache.commons.codec.binary.Base64
cannot be found.
I think I'll start with mimicking current naive parsing and not bring over provided exclusions.