yob/pdf-reader

Considerations for using ascii85_native gem

AnomalousBit opened this issue · 4 comments

Hello!

I've written a new Ascii85 encoder/decoder gem that uses native C extensions to vastly improve performance over the pure-Ruby ascii85 gem used today. When charting out a flamegraph using pdf-reader to parse several hundred small PDFs, around 60% of the execution time was being spent in the ascii85 gem used today. Using this new gem, it's down to about 1%.

I've been using ascii85_native on top of pdf-reader successfully for the past few weeks in production to parse out hundreds of short PDF files without any issues, but compared to the ascii85 gem it's obviously no where near as battle-tested. However, I'm sure I'm not the only one who might choose performance gains over a potential decrease in stability and wanted to give back.

Right now my fork of pdf-reader replaces Ascii85 completely. I know this is probably not what you would want to merge, which is why I didn't open a PR. If there is any interest, I'm curious if you might have any thoughts on how we might best provide a choice to use ascii85_native?

Some thoughts passed along to me include:

  • Adding a new Ascii85NativeFilter and passing it on init: PDF::Reader.new(io, Ascii85NativeFilter)
  • Change the ascii85_native gem (or a fork) to use the same namespace as the Ascii85 gem (this seems dirty).

Hope this helps and someone finds it useful!

ascii85_native gem: https://rubygems.org/gems/ascii85_native
ascii85_native repo: https://github.com/AnomalousBit/ascii85_native
pdf-reader fork using ascii85_native: https://github.com/AnomalousBit/pdf-reader

yob commented

Interesting!

I assumed Ascii85 filters were relatively uncommon in the wild. Maybe my assumption is wrong, or maybe the files your processing come from a PDF library with a preference for Ascii81?

One of the design goals of pdf-reader was to keep it native ruby so it can run anywhere. The performance gains from your gem are significant though.

I'm not that keen on the idea of making ascii85_native a dependency in the gemspec. However, I'm open to having pdf-reader dynamically try to load ascii85_native and using it if it finds it. We could also add a note to the README (maybe in an advanced or optimisations section a bit further down) that suggests adding it to a project Gemfile like this:

gem 'pdf-reader'
gem 'ascii85_native'

If we go that path, it's probably easiest to keep the module name in ascii85_native distinct from the ascii85 gem. I think you could then do something like:

if defined(::Ascii85Native)
  ::Ascii85Native::decode(data)
else
  ::Ascii85::decode(data)
end

Thank you for your insights, @yob!

Your comment sparked my curiosity, a quick unscientific glance using my smallish collection of 4000 PDFs from various sources is: the Ascii85 to Flate (zlib) usage ratio is about 1:19 or 5%. This is probably not the big win I was hoping for. I had no idea about the popularity of Flate to Ascii85.

Regardless, a client I'm working with provides me with exclusively Ascii85 encoded PDFs. Maybe this will help someone else who lands in a similar situation.

I really like your suggestion of looking for the defined Ascii85Native module, happy to open a PR following your recommendation if you will consider merging it.

Thanks for sharing and maintaining your awesome gem!

yob commented

Sounds good! I'll keep an eye out for the PR.

yob commented

Support for the ascii85_native gem shipped in 2.6.0