gbif/dwca-io

heavy transitive dependencies

Closed this issue · 5 comments

hey y'all -
Thanks for making dwca-io . . . worked pretty good for me so far!
I did notice that the Elton standalone jar significantly grew after adding the dwca-io. After some digger, I found some heavy dependencies like org.apache.poi etc. and found myself excluding dependencies to reduce the jar by 10s of MBs . I am sure that there's a good reason for including the dependencies, so I won't be offended if they stay were they are. Just wanted to share my findings.

Here's the exclusions I am using now:

<dependency>
            <groupId>org.gbif</groupId>
            <artifactId>dwca-io</artifactId>
            <version>2.2</version>
            <exclusions>
                <exclusion>
                    <groupId>org.apache.poi</groupId>
                    <artifactId>poi</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.apache.poi</groupId>
                    <artifactId>poi-ooxml</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.apache.odftoolkit</groupId>
                    <artifactId>simple-odf</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>com.googlecode.owasp-java-html-sanitizer</groupId>
                    <artifactId>owasp-java-html-sanitizer</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.freemarker</groupId>
                    <artifactId>freemarker</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

Hope this helps and curious to hear thoughts.

It was also discussed in Issue #39

Thanks @jhpoelen - firstly, at a glance those exclusions looks sensible (assuming you want read only, and no open data formats / excel support)

The dwc-io could definitely benefit from a fairly refactor to ensure it has a tiny footprint, rigorous selection of dependencies, vendoring all dependencies to avoid user pain and with a separate module for format transformations (e.g excel, open data formats etc).

With that said though, it's used everywhere and like you have discovered developers have generally found reasonable workarounds without too much effort. I suspect this is why it has not been acted upon quickly.

If you do find blocking issue that you can't easily get around please do let us know. If you have time and motivation to work on this we also welcome collaboration (on this I suggest a proposal for change needs documented and discussed first as it'll affect several products).

Thanks for your replies @timrobertson100 and @cgendreau .

Here's my proposal :

  1. keep dwca-io module to be kind of existing users that either need or are not too concerned about adding several 10's of MB of dependencies.
  2. make dwca-io module dependent on new modules like dwca-io-core, dwca-io-simple, dwca-excel, etc.
  3. dwca-io-core would include only the stuff needed to read from vanilla dwca archives

As far as doing the actual work, I'd be open to hearing your ideas.

I've moved the spreadsheet handling stuff out of gbif-common and into gbif-common-spreadsheet (released). I've removed BeanHtmlSanitizer and its three library dependencies, since I can't see anywhere it's used (committed, not released).

We have both Guava and Apache Commons utils, but I think we use non-basic bits of both (e.g. HTML entity decoder). Freemarker is used to write meta.xml files. We could rewrite things, but there's no longer obvious gains by splitting

@MattBlissett thanks for making this happen!

I've upgraded to v2.3 and noticed the spreadsheet deps have disappeared. Am planning to remove exclusions of com.google.code.findbugs:jsr305, commons-beanutils:commons-beanutils, com.googlecode.owasp-java-html-sanitizer:owasp-java-html-sanitizer in future version given your commits gbif/gbif-common@ea2cf9f and b703caa after the v2.3 / v0.42 releases.