cljdoc/cljdoc-analyzer

Our dependency parsing is sometimes too naive

Closed this issue · 2 comments

lread commented

Currently

Cljdoc-analyzer naively uses jsoup (an HTML/XML parser) to parse the pom to learn about an artifact's dependencies.

Scenarios this can fail include:

  1. pom properties (ex. version is <version>${clojure.version}</version>
  2. managed dependencies (ex. version is specified under <dependencyManagement> but not under <dependencies>) (actually we do naively try to handle this).
  3. parent poms are not considered

Example problem: cljdoc/cljdoc#742

Analysis

We already use clojure tools deps to resolve dependencies.
It could resolve all of the above.

Except for one case.
Clojure tools deps, by design, filters out everything but runtime dependencies.

Cljdoc currently includes provided scope deps as a way for a library to specify optional dependencies. These are, by design, excluded by clojure tools deps.

Specific usages

The cljdoc analyzer currently uses naively parsed <dependencies> to:

  1. add pom provided deps to the classpath (it also, btw, includes system and test scope dependencies and deps marked as optional, I'll refer to all of these as provided)
  2. remove problematic deps
    1. those with an artifact name starting with boot- (not sure the history of this one, but whatevs for now)
    2. org.clojure/tools.reader to avoid conflicting with cljs tools.reader
  3. ensure key deps use recent versions, if not, automatically bump them
  4. ensure key deps are present, ex. javax.servlet/javax.servlet-api is always included because some enough libs assumed it was provided.

In all the above adjustments, we currently only look at deps specified in the analyzed lib pom.
We don't look at transitive deps.

Thoughts

We could simply take advantage of clojure tools deps resolving support.
Except that it does not include provided scope dependencies.

Scope Provided deps

Option 1: Hack tools deps to return provided deps

There might be just a couple of spots to alter-var-root.

Con: brittle, we'd be hacking tools deps internals and those are subject to change
Pro: might be easy in the near term

Option 2: Use pomegranate

Con: Deps resolution differs from tools deps, but we could use it solely to discover provided deps.
Former con: lib was stale with CVEs, but I recently did a maintenance pass to resolve that.
Pro: We write less code
Neutral: Depends on maven resolver libs, but so does clojure tools deps

Option 3: Use maven resolver directly

Pro: no extra deps
Con: java interop
Con: might be just duplicating something quite small that works already (pomegranate)

Current choice

I think I'll explore option 2 first and see where that leads.

Transitive deps?

Should we be looking at these too? (see "specific usages" above).

For provided deps we probably just want to check non-transitives.
For other deps fixups we could fixup transitive runtime deps, but we'll start with current behaviour of only looking at non-transitive deps.

Future-features

I'm not going to worry right now about our plan to someday build docs from git-sources and how decisions for this issue might impact that feature.

Upstream usage

Does the cljdoc web site want/need any of this information?
The work could be saved for cljdoc if it needs it.
We do save the raw pom and I do see naively parsed dependencies loaded by cljdoc... but I don't think this data is currently used.

lread commented

Notes from an exploration

I've learned that pomegranate does not return provided deps by default.
We can twist its arm to do so, but I'm going to take another look at tools deps.

I am seeing some code that looks like a closer fit for cljdoc's current algorithm/strategy.
In clojure.tools.deps.extension.pom (which is internal, it is marked with :skip-wiki, which is the same as :no-doc) we have:

  • read-model which reads a pom (not naively)
  • model-deps returns non-transitive runtime deps for a pom

I could:

  • take the risk of using the internal read-model as is.
  • and bring over model-deps code, but modify it to return all non-transitive deps so we can get at provided deps

This exploration is a variation of option 1.

It might be fine for a first go.
If we learn it is too brittle moving forward, we can adapt.

I could question cljdoc-analyzer's current strategy of reading the pom, but I don't think I'll do that at this time.

lread commented

More Notes - provided exclusions

Things are starting to work locally.

One interesting thing is that our naive parsing of the pom did not pick up exclusions for provided deps.

Will dig deeper, but I think this is why http://repo.clojars.org/metosin/compojure-api/2.0.0-alpha27/compojure-api-2.0.0-alpha27.pom is failing for me locally. It excludes commons-codec/commons-codec, but ring.util/codec get loaded during analysis and fails because org.apache.commons.codec.binary.Base64 cannot be found.

I think I'll start with mimicking current naive parsing and not bring over provided exclusions.