sparklemotion/nokogiri

liblzma / xz dependency necessary?

gastonmorixe opened this issue · 5 comments

I see liblzma / xz is referenced in a few places and that the latest versions including "pre-build" dependencies (though I haven't seen them including liblzma/xz if I checked correctly).

Out of extra precaution, is it really necessary? asking not only for xz but - if we learn something from what happened (CVE-2024-3094) - for other dependencies as well to minimize future supply chain risks.

Thank you

https://nokogiri.org/tutorials/installing_nokogiri.html?h=liblzma

tar and xz files[](https://nokogiri.org/index.html#tar-and-xz-files)

Starting in v1.13.2, the source archive used for libxml2 and libxslt 
is compressed with xz (previous versions were compressed with gzip. 
As a result, when compiling from source, your system will need 
to have xz installed in order to extract the source code for these libraries.

#1694 (comment)

FranklinYu commented on Dec 1, 2017

I'm actually familiar with neither. AFAIK, liblzma (part of [XZ utilities](https://tukaani.org/xz/) 
is intended as drop-in replace of libz (and they claim better performance), 
so the compiler script tries liblzma before searching libz. According to my quick search.

url: "#{gnome_source}/sources/libxml2/#{minor_version}/#{recipe.name}-#{recipe.version}.tar.xz",

url: "#{gnome_source}/sources/libxml2/#{minor_version}/#{recipe.name}-#{recipe.version}.tar.xz",

ldd = %x(#{["env", "LANG=C", cross_ruby.tool("otool"), "-L", dll].shelljoin})
if (liblzma_refs = ldd.scan(/^\t([^ ]+) /).map(&:first).uniq.grep(/liblzma/))
liblzma_refs.each do |ref|
new_ref = File.join("/usr/lib", File.basename(ref))
sh(["env", "LANG=C", cross_ruby.tool("install_name_tool"), "-change", ref, new_ref, dll].shelljoin)
end

Thanks for asking this question. I'll take some time to reply thoughtfully.

Thank you

@gastonmorixe Thanks for your patience.

libxml2's tarballs have been .tar.xz since 2.9.13 (and libxslt's since 1.1.35) and this appears to be a GNOME project requirement (see https://gitlab.gnome.org/Infrastructure/Infrastructure/-/issues/768 for context). There doesn't seem to be anything this project or the libxml2 project maintainers can easily do to change this, but please feel free to raise an issue with the GNOME infrastructure project. In any case, this file's checksum is verified before it is decompressed and so we consider it "trusted" which should mitigate the risks associated with requiring xz-utils.

With respect to removing compression code from the library, https://gitlab.gnome.org/GNOME/libxml2/-/issues/709 indicates where upstream is headed. While many distros may continue to ship with compression enabled (and this project can't control that) we can control what options we compile the vendored libxml2/libxslt with, and I'll explore whether we can do that without breaking Nokogiri functionality.

Action items for me

  • explore removing compression from Nokogiri's vendored source and precompiled native builds

OK, I've done some exploration of removing the compression libraries from the libxml2 build and adding this test:

commit 22dd128b
Author: Mike Dalessio <mike.dalessio@gmail.com>
Date:   2024-04-07 13:30:12 -0400

    add test for how we handle compressed files
    
    this is insufficient coverage, there are other methods that likely
    support decompression. this test is only intended to demonstrate the
    problem with removing zlib/lzma from the libxml2 build.

diff --git a/test/files/staff.xml.gz b/test/files/staff.xml.gz
new file mode 100644
index 00000000..41404cce
Binary files /dev/null and b/test/files/staff.xml.gz differ
diff --git a/test/xml/sax/test_parser.rb b/test/xml/sax/test_parser.rb
index 8551a346..5d9aa343 100644
--- a/test/xml/sax/test_parser.rb
+++ b/test/xml/sax/test_parser.rb
@@ -274,6 +274,14 @@ def call_parse_io_with_encoding(encoding)
           end
         end
 
+        it "parses a compressed file" do
+          filename = XML_FILE + ".gz"
+          parser.parse_file(filename)
+
+          refute_nil(parser.document.start_elements)
+          assert_operator(parser.document.start_elements.count, :>, 30)
+        end
+
         it :test_render_parse_nil_param do
           assert_raises(TypeError) { parser.parse_memory(nil) }
         end

This test passes on CRuby main, but fails on JRuby main and on CRuby if we remove zlib and lzma from the libxml2 build.

This test demonstrates that there are a handful of "file open" methods that will decompress on the fly if the file is compressed and libxml2 supports it.

This is undocumented behavior, and we have no test coverage for it. But I'm hesitant to rip it out without some careful thought given how widely used Nokogiri is.

Going to think about this for a bit, but I think our options are:

  1. do nothing, keep zlib/liblzma as a dependency.
  2. rip it out, potentially breaking some users who will have to work around it.
  3. remove the libraries from the libxml2 build, but fully support decompression by adding this feature to the Nokogiri gem.

In an ideal world, I'd like to do 3, but that seems like a lot of work and there is more urgent work I'd prefer to do in Nokogiri.

Like I said, I'm going to think about this.