readium/architecture

Clarification on lvl. 1 pub servers' "in-memory" publications

chocolatkey opened this issue · 5 comments

At the moment, I have started expanding what I would call my golang publication server/streamer to be closer to the reference implementations so that it is more closely compatible with the official spec. Something that I've noticed the reference JS streamer does is load up all publications in-memory as described in the Level 1 spec (https://github.com/readium/architecture/tree/master/server#level-1): "They must have an in-memory representation of the publications that they serve".
Am I correct in assuming that "in-memory" means the server's RAM (if not ignore next paragraph or so)? When starting the JS streamer on a collection of thousands of EPUBs, it takes a long time to start because it is loading up every single publication and creating a feed. I would like to conform to the spec, however caching metadata on thousands of publications in-memory results in slow startup times and large memory usage. In my implementation, I have been waiting until the publications are requested for the first time before parsing them and then caching them for a while (I also haven't had to deal with archives, as I've only been loading exploded publications), however I understand the need to have them loaded so that a feed can be generated for them. This seems to call for a proper database of some sorts or directory sorting (like I do).
Would it be possible to have a lvl. 1+ compliant publication server that does not "have an in-memory representation of the publications that they serve"?

Edit: the r2 golang streamer does the same thing, loading all publications in-memory (although in a non-blocking manner)
Edit 2: Another option I've been considering is moving the generation of the ODPS feed away from being the publication server's direct responsibility

Hello, thank you for your input!

When starting the JS streamer on a collection of thousands of EPUBs, it takes a long time to start because it is loading up every single publication and creating a feed.

Just to clarify: by "feed" I assume you do not mean OPDS feed. The creation of an OPDS feed is indeed a time-consuming operation in the current prototype/experimental implementation (i.e. micro-service in r2-streamer-js, for test purposes only), but this is an opt-in feature that must be explicitly requested (i.e. not automatically launched on startup).

So, by "feed" I think you refer to the fact that the current r2-streamer-js CLI utility starts-up an instance of the server by scanning a designated folder on the filesystem, in order to find publications file paths to serve. Yes, I can imagine that this may be a time-consuming operation when many EPUBs are present,.

In my implementation, I have been waiting until the publications are requested for the first time before parsing them and then caching them for a while

This is exactly what r2-streamer-js does. Although the designated filesystem folder is scanned on startup, the actual publications are lazy-loaded (i.e. the actual file loading and EPUB parsing only occurs when a request is received). Once a publication is loaded/parsed, the WebPubManifest is ready and it is stored in an in-memory runtime cache, in order to avoid costly loading/parsing in subsequent requests. See server.ts loadOrGetCachedPublication().
https://github.com/readium/r2-streamer-js/blob/6afd2362c4e0eaa47bac2abac89340fdee52c2e1/src/http/server.ts#L324

@danielweck Actually I did mean ODPS feed, but what you explained was useful anyway. That's why I'm thinking of moving the ODPS feed off the publication server, because it would be nice for it to remain database-less. I'm planning on making my publication server compatible with multi-tenant setups anyway, so it would make sense for each tenant to have their own feed. I'm open to better ideas

actually I came up with a better way to do things: make the features that require a db and search indexing optional and expandable by using the publication server as a package in a codebase that includes that functionality

When starting the JS streamer on a collection of thousands of EPUBs, it takes a long time to start because it is loading up every single publication and creating a feed. I would like to conform to the spec, however caching metadata on thousands of publications in-memory results in slow startup times and large memory usage.

There's no requirement to do that.

A smarter approach to this problem would be to define a LRU cache when initializing the server and dynamically fetching/parsing packaged publications as you need them. Thanks to the LRU cache, frequently accessed publications would remain in memory, while less frequently accessed publications would eventually get swapped out of it.

When starting the JS streamer on a collection of thousands of EPUBs, it takes a long time to start because it is loading up every single publication and creating a feed.

@HadrienGardeur , this quote is an incorrect / misleading statement, so I would like to clarify once again for readers who will miss / skip the previous messages:

In the r2-streamer-js implementation, there is an optional OPDS "micro service" which constructs a feed that corresponds to all the publications currently registered within the server instance. This OPDS feed is created by a non-blocking process, the first time a well-known HTTP route is explicitly requested (i.e. not at server startup), or whenever the URL is fetched after the OPDS feed is invalidated (i.e. when a publication is added to / removed from the streamer's internal state). This experimental / prototype OPDS feature is not part of the Readium2 architecture for the "streamer" component, it is provided specifically in the r2-streamer-js implementation to demonstrate + test the OPDS2 format which is based on the ReadiumWebPubManifest model. Note that there is also a JSON-Schema validation pass (when pretty-printing the feed) which can itself be quite time-consuming, but once again this does not affect server startup.

Now, regarding the "streamer"'s internal state: there is an in-memory registry that records paths/URLs where publications can be fetched, and there is a lazy-loading strategy to avoid unnecessarily stressing the server instance at startup. The processing costs related to loading + parsing publications (i.e. computing the actual ReadiumWebPubManifest models) are incurred only on-demand, during incoming HTTP requests. The RWPM definitions are stored in an in-memory cache to optimize subsequent requests. The decision to destroy cache entries and to remove publications from the internal registry is an integration concern.
For example, the Readium "desktop" app programmatically invokes the server API exposed by r2-streamer-js in order to manage the lifecycle of the "streamer" instance and its registered / loaded publications (typically, the server is not needed at all when users browse publications in the app's bookshelf, but the server needs to be started when users want to actually read books).
The r2-streamer-js deployment at Heroku and Now.sh is for demonstration / test purposes, and its CLI bootstrapper currently populates the registry of publications at startup (but does not load them), based on a filesystem folder which is scanned to discover available EPUBs. Publications can subsequently be added as demonstrated by the HTTP route which loads an external URL (this feature is used mostly for testing the "streamer" component as a proxy for remotely-hosted publications). A real-world integration would need to implement all necessary services and to define associated HTTP routes, in order to manage publications as required (i.e. registry add/remove, cache load/unload).

PS: the r2-streamer-js cache of loaded ReadiumWebPubManifest models currently grows indefinitely. There has been a "TODO" comment from day-1 in the TypeScript source code to implement a LRU (Least Recently Used) caching strategy. As the sole developer contributor (to date) to r2-streamer-js, I have not implemented LRU due to this being unnecessary in the context of the Readium "desktop" application (which is the primary official/known integration of r2-streamer-js, as far as I know). I have now filed an issue to track this for developers who need to integrate r2-streamer-js inside long-lived / rarely-restarted server containers: readium/r2-streamer-js#47