danfickle/openhtmltopdf

Cache + URL resolver doesn't work as expected

delafer opened this issue · 19 comments

Cache + URL resolver doesn't work as expected

            builder.useUriResolver(new NaiveUserAgent.DefaultUriResolver() {
                @Override
                public String resolveURI(String baseUri, String uri) {

                    String inClassPath = FsPdfGenerator.localResource(uri);
                    if (null != inClassPath) {
                        logger.info("Loading from classpath: "+uri);
                        java.net.URL url = FsPdfGenerator.class.getResource('/' + inClassPath);
                        if (url != null)
                            return url.toExternalForm();
                    }

                    return super.resolveURI(baseUri, uri);
                }

            });
  • a very simple test cache implementation
@Singleton
public class SimpleCache implements FSCache {

    private static final Logger logger = LoggerFactory.getLogger(SimpleCache.class);

    public static final int MAX_SIZE = 512;

    private final Map<FSCacheKey, Object> cache = lruCache(MAX_SIZE);

    private static <K,V> Map<K,V> lruCache(final int maxSize) {
        return new LinkedHashMap<K,V>(maxSize*4/3, 0.75f, true) {
            @Override
            protected boolean removeEldestEntry(Map.Entry<K,V> eldest) {
                return size() > maxSize;
            }
        };
    }

    private ReentrantReadWriteLock rwl = new ReentrantReadWriteLock();

    public Object get(FSCacheKey cacheKey) {

        try {
            rwl.readLock().lock();
            Object obj = cache.get(cacheKey);
            if (logger.isDebugEnabled()) {
                logger.debug("Requesting: {} of type: {}, got it: {}", cacheKey.getUri(), cacheKey.getClazz().getName(), (obj != null));
            }
            return obj;
        } finally {
            rwl.readLock().unlock();
        }
    }

    public void put(FSCacheKey cacheKey, Object obj)  {
        try {
            rwl.writeLock().lock();
            if (logger.isDebugEnabled()) {
                logger.debug("Putting: {} of type: {}", cacheKey.getUri(), cacheKey.getClazz().getName());
            }
            cache.put(cacheKey, obj);
        } finally {
            rwl.writeLock().unlock();
        }

    }
}

In log you can see next messages:
com.openhtmltopdf.css-parse WARNING:: Couldn't parse stylesheet at URI file:/G:/Programme/tomcat/tomcat8.lst/webapps/ROOT/WEB-INF/classes/styles/resources/standard.css:

Stream closed
java.io.IOException: Stream closed
	at sun.nio.cs.StreamDecoder.ensureOpen(StreamDecoder.java:46)
	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:148)
	at java.io.InputStreamReader.read(InputStreamReader.java:184)
	at com.openhtmltopdf.css.parser.Lexer.zzRefill(Lexer.java:1634)


java.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?
	at org.apache.pdfbox.cos.COSStream.checkClosed(COSStream.java:82)
	at org.apache.pdfbox.cos.COSStream.createRawInputStream(COSStream.java:130)
	at org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1203)
	at org.apache.pdfbox.cos.COSStream.accept(COSStream.java:388)

It's embarrassingly simple really, the cache appears to be completely broken! The first exception is because the CSS resource reader is closed after being read the first time. The second exception is because the bytes of an ImageResource are cleared after being converted to a PDF image, making it invalid to use a second time.

We can make this the cache system replacement thread. As outlined in #170 I think we need four caches. The first should be an internal per-page cache, with resources tied to a particular page. The second should be an internal per-document cache. We already have caching at these levels but it is ad-hoc throughout the code. The next cache should be a per-thread cache which will enable us to cache thread-unsafe objects. It will be external of the project but cache opaque objects. Finally we need an external thread-safe cache of last-resort for strings and byte arrays.

I think we start with the last one. Something like:

public class DefaultMultiThreadCache {
   protected final Map<String, String> _textCache = new ConcurrentHashMap<String, String>();
   protected final Map<String, byte[]> _byteCache = new ConcurrentHashMap<String, byte[]>();

   // Question: How best to synchronize this method?
   // Does it even need to be synchronized?
   public String getText(String uri) {
      String text = _textCache.get(uri);
      if (text != null) {
         return text;
      } else {
         byte[] bytes = getBytes(uri);
         if (bytes != null) {
            return new String(bytes, "UTF-8");
         } else {
            return null;
         }
      }
   }

   public byte[] getBytes(String uri) {
      return _byteCache.get(uri);
   }

   public void put(String uri, String text) {
      _textCache.put(uri, text);
   }

   public void put(String uri, byte[] bytes) {
      _byteCache.put(uri, text);
   }

}

Have you considered using the guava caches to implement the caches? It would of course mean to have guava as dependency on this project, but they have done caches right, and you can configure the cache however you want (i.e. with soft/weak refs, timeouts, max item count, ...). See the details at https://github.com/google/guava/wiki/CachesExplained

Using guava caches would mean that the cache just works, as they did extensive tuning and fixing.

// Question: How best to synchronize this method?
// Does it even need to be synchronized?

no, in this case it's not necessary at all, but we can't use ConcurrentHashMap "as-is", we should delete old or rarely used cached entries or limit maximum amount of entries, and if we will implement a more complex logic suitable for cache functionality , I mean some real cache strategy in this case most likely synchronization will be necessary. I should look the new code and I can help to make it thread-safe.

Yes, we can use SoftReference's to cache elements. As a default implementation a simple build in cache provider could be used, without third party deps, but if it is not enough, you can provide the opportunity to use more powerful cache solutions. So for example it's done in a hibernate with a connection pool, by default a simple built-in implementation is used, but there is a possibility to use (configure) the third party Implementations for production environments.

Using as default implementation commons third party libraries has also lead to problems e.G. if project already uses Guava cache for own needs but a different version with incompatible API (deprecated) methods and we will have dependencies resolution conflicts e.G. in maven with different versions , in most cases it's not critical, but it's annoying. It doesn't mean , that its a bad idea, but..
everything has its pros and cons. On the one hand, a well-tested stable implementation like guava cache or eh cache or something else, on the other hand, the problems described above. That's why a best practice to use build in simple implementation and API to give an ability to use third party cache implementations without necessity to write a lot lines of code and big headache.

I think @delafer is right here: We should only provide a very simple basic cache implementation, maybe even not multithreaded and not shared between runs and allow to provide a custom cache implementation with the Builder. If we have a nice interface it should be trivial to e.g. use the Guava cache to implement a real multithread safe cache, and also the user can configure the cache however he wants. Some applications may have loads of memory and may only need soft refs (if at all), other applications might be tight on memory and then use a size based evict strategy (e.g. in Guava using maximumWeight() and weighter()). Or implement a disk based cache etc.

Guava is very useful, but it's true that we could cause a bit of a dependency hell for the users of OpenHTMLToPdf. We could of course create a new openhtmltopdf-cache artefact, which would only contain some Guava based default implementations. But I am not sure if this is worth the effort.

We could of course create a new openhtmltopdf-cache artefact, which would only contain some Guava based default implementations. But I am not sure if this is worth the effort.

I think @rototor has right. Cache could speed-up generation of pdf documents, but openhtmltopdf is not a database or JPA layer or big data implementation or SOLR/Lucene search&index frameworks, I mean it's not a library which intensively works with a large amounts of input/output data, with a huge data flow, in other words: in most cases cache does not play a major role here and for most cases a simple solution is enough. it is not worth it

OK, I've started the new implementation in commit 1fa6701

Tell me what you think.

@danfickle Your cache interface features a get() and a put() operation. Thats fine per se, but does not allow to benefit from a unique property of the Guava caches: They guarantee that a value for a given key is not calculated twice by two threads at the same time. When fetching network resources this means that the cache can guarantee that you don't fetch the same resource twice and waste bandwidth. If two threads want to get the same key at the same time, one thread will wait for the result of the other thread.

To allow this, the FSMultiThreadCache interface must look different:

interface FSMultiThreadCache<TValue> {
  TValue get(String uri, Callable<TValue> lazyValueFetcher); 
}

I.e. the cache decides when to fetch the value. To avoid creating additional objects per cache call the lazyValueFetcher should be an interface like this:

interface FSMultiThreadCacheValueFetcher<TValue> {
   TValue fetchForUri(String uri);
}

This can be implemented once per UserAgent. It would then just delegate this to openReader()/openStream() and read the value (as it is now already).

A trivial default cache implementation would be:

public class SimpleNonCache<TValue> implements FSMultiThreadCache<TValue> {
  public TValue get(String uri, FSMultiThreadCacheValueFetcher<TValue> lazyValueFetcher) {
     return lazyValueFetcher.fetchForUri(uri);
  }
}

P.S.: Prefixing a template/generic parameter with T is just an old C++ habit ... :)

little joke: " American astronaut Scott Kelly condemned US President Donald Trump for threatening to start World War III. And what could he do, being held by the KGB agents, standing in front of the airlock? "

@danfickle : about last post of @rototor . He is damn right. Your simple map-like API doesn't help avoid loading some resources twice or even more times.
A very simple example.
e.G. to load some image or svg from remote URL you needs at least 250 ms.
You are starting 4 threads parallel in order to generate PDF documents with 100 ms interval.

Timeline:
0 ms
First thread was started , trying to fetch remotely "sample.jpeg", but it's doesn't exists in a cache, so we this thread tries to load at and put in a cache,

100 ms
Second thread was started 100 ms later trying to fetch "sample.jpeg", but it doesn't exists in the cache, so trying to load it and put in a cache

200 ms
the third thread started 200 ms later trying to fetch "sample.jpeg", but it still doesn't exists in the cache, so trying to load it and put in the caсhe

250 ms
first thread has finally loaded "sample.jpeg" resource and executes something like "cache.put(key, image_data)" , so put's raw image data in the cache.

300 ms
the fourth thead also started trying to fetch "sample.jpeg" and at this stage the first thread has already finished loading "sample.jpeg" (and stored it in the cache). So the forth tread fetching it finally from the cache instead of remote URL.

summing up: "sample.jpeg" will be loaded 3 times remotely and 1 time fetched from the cache. This implementation is still thread-safe, but it's not effective enough for caching purposes.

Thats why a special API like in guava should be used avoid redundant (remote) requests to save time and to save bandwidth usage / traffic.

This API forces other threads to wait until the necessary resource will be loaded. So, second and third, etc. threads should wait for the first thread until first thread finishes loading "sample.jpg"

In the next post I will describe how it works without third party libriries like guava. I will show a simple thread safe example using standard java thread locks and notifiers, instead of modern concurrency clases / patterns, which are more difficult to understand.

Prefixing a template/generic parameter with T is just an old C++ habit ... :)

I know it looks just like old good c++ or object pascal / delphi code. so I have immediately thought that you are an old-style programmer like me with an experience with c/c++ or with object pascal & delphi RAD or with all at once.

So, I have written now some simple cache implementation. It should be thread-safe and efficient enought,
It's "fresh" code, not tested, but I think I should work fine. I think It's a good example how to write such caching classes.

import java.util.LinkedHashMap;
import java.util.Map;
import java.util.WeakHashMap;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.ReentrantReadWriteLock;

public class SimpleCacheImpl<K,V> {

    private Map<K,CacheEntry<V>> cacheMRU;
    private Map<K,CacheEntry<V>> cacheLRU;
    private ReentrantReadWriteLock lock;

    public SimpleCacheImpl(final int capacity)
    {
        init(capacity);
    }

    private void init(int capacity) {
        lock = new ReentrantReadWriteLock();
        cacheLRU = new WeakHashMap<>();

        cacheMRU = new LinkedHashMap<K,CacheEntry<V>>(capacity*4/3, 0.75f, true) {
            protected boolean removeEldestEntry(Map.Entry<K,CacheEntry<V>> entry)
            {
                if (this.size() > capacity) {
                    cacheLRU.put(entry.getKey(), entry.getValue());
                    return true;
                }
                return false;
            };
        };
    }

    public V get(K key)
    {
        try {
            lock.readLock().lock();
            CacheEntry<V> value = getCacheEntry(key);
            return value != null ? value.get() : null;

        } finally {
            lock.readLock().unlock();
        }
    }

    public CacheEntry<V> set(K key)
    {
        try {
            lock.writeLock().lock();
            CacheEntry<V> entry = getCacheEntry(key);
            if (entry == null) {
                entry = new CacheEntry<>();
                cacheMRU.put(key, entry);
            }
           return entry;
        } finally {
            lock.writeLock().unlock();
        }
    }


    private CacheEntry<V> getCacheEntry(K key) {
        CacheEntry<V> value = cacheMRU.get(key);
        if (value == null) {
            value = cacheLRU.get(key);
            if (value!=null) {
                moveToMRU(key, value);
            }
        }
        return value;
    }

    private void moveToMRU(K key, CacheEntry<V> value) {
        cacheLRU.remove(key);
        cacheMRU.put(key, value);
    }

    /**
     * The Class CacheEntry.
     *
     * @param <V> the value type
     */
    static class CacheEntry<V> {

        /** The obj. */
        private V obj;
        private boolean set;

        private final static long TIMEOUT_MS = TimeUnit.MINUTES.toMillis(5);

        /** The lock obj. */
        private final Object lockObj = new Object();

        /**
         * Instantiates a new cache entry.
         */
        public CacheEntry() {

        }

        /**
         * Returns value containing object.
         *
         * @return the cached value
         */
        public V get() {

            try {
                synchronized (lockObj) {
                    long init = System.currentTimeMillis();
                    long current = init;
                    while (!set && ((current-init)<TIMEOUT_MS)) {
                        lockObj.wait(TIMEOUT_MS);
                        current = System.currentTimeMillis();
                    }
                }
            } catch (InterruptedException ignore) {}
            return obj;
        }

        /**
         * free locked entry
         */
        public void release() {
            synchronized (lockObj) {
                this.obj = null;
                this.set = true;
                lockObj.notifyAll();
            }
        }

        public void set(V obj) {

            synchronized (lockObj) {
                this.obj = obj;
                this.set = true;
                lockObj.notifyAll();
            }
        }

        @Override
        public String toString() {
            return obj != null ? obj.toString() : "empty";
        }
    }

}

Of course, such imlementation is only an example it could be easily optimized and tuned for special purposes. Guava uses much more advanced API for such cases, but as a illustration of some principles of work it could be used.
usage

In regards to the byte[] and String caches, I was thinking of avoiding the UTF-8 conversion. But since CSS is mostly ascii characters, this is likely to be a lite-weight conversion anyway.

Actually, I'm rethinking whether it is worth having a cache at this level at all. Users are likely to get their files from four main sources:

  • Files - for files the OS already provides a cache that's more performant than anything we use in Java. It is also better placed to know when to evict items.
  • Network - I don't think many users will try to use the network for their resources, but even if they do, it is recommended to plug in a network protocol stream factory, rather than use the ancient URLConnection that we use by default. They could easily plug-in a cache at the stream factory.
  • JARs - This is the only one I think needs a cache, as otherwise you get the hit of decompressing a zip entry every time.
  • Custom - Again, have to use a custom stream factory, so why not plug in a cache there?

I think it might be a better use of time to concentrate on a per-thread cache which could cache "compiled" objects such as the in-memory structure of a parsed CSS sheet, or a decompressed PNG. This might give worthwhile performance benefits when processing many documents.

What do you think?

I think it might be a better use of time to concentrate on a per-thread cache which could cache "compiled" objects such as the in-memory structure of a parsed CSS sheet, or a decompressed PNG. This might give worthwhile performance benefits when processing many documents.

For sure, it's a correct approach (like it's done for e.G. in Chrome, etc), but it is more difficult to implement.

cachelikechrome

Custom - Again, have to use a custom stream factory, so why not plug in a cache there?

Yes, but if we plug cache here, a non working cache / buggy cache implementation in openhtmltopdf library should be cleaned / removed from code (as a deprecated non-working impl.) or fixed. Why do we need a cache impl. in openhtmltopdf code, which doesn't work rightly ? You are project owner / maintainer, so you should decide.

In regards to the byte[] and String caches, I was thinking of avoiding the UTF-8 conversion. But since CSS is mostly ascii characters, this is likely to be a lite-weight conversion anyway.

IMHO: it's more convenient way doing such things:

    public Map<String, CacheEntry> mixedCachedEntries = new ConcurrentHashMap<>();

    interface CacheEntry {

    }

    interface ByteArrayCacheEntry extends CacheEntry {
        byte[] getData();
    }

    interface TextCacheEntry extends CacheEntry {
        String getData();
    }

@danfickle Have you worked before with jmc.exe (Java Mission Control :: FlightRecorder) profiler tool ? should be used together with oracle JVM and VM options:

-XX:+UnlockCommercialFeatures -XX:+FlightRecorder"

It's better as jvisualvm and rather convenient, to find memory leaks and places in the code to optimize (where for example you need a cache or other optimizations steps)

@danfickle
BTW: If you interested to save memory & heap / memory allocation. You could read next article:

overview-of-memory-saving-techniques-java

if briefly: working only with byte[] arrays in a right way could help us to save memory. You could use also off-heap memory areas. Strings in java/jvm (jdk / jvm < 8.xx) always stored internally in unicode format. Latest versions >= 8 using some techniques for strings - to save / spare memory allocation.

All caches except the font metrics cache are now removed. Future caches will only be introduced if a compelling performance case can be made. Thanks everyone.