emgarten/NuGet.CatalogReader

Smarter caching on catalog pages and items?

Opened this issue · 2 comments

Today, the caching mechanism of CatalogReader is based on time. Catalog is an append-only structure so caching can be done in a smarter way.

I can think of these options for improving this:

1. Cache items (leafs) forever and pages with a non-MAX commitTimeStamp forever

For NuGet.org catalog implementation, this should be sufficient since catalog items never change and only the last page of the catalog changes. Since there is no way to compare catalog pages other than commitTimeStamp, we have to treat all pages with this MAX commitTimeStamp value as the "last" page. In reality, there is only ever one page with the MAX commitTimeStamp since a bit of time always pages between two commits.

However, both CatalogReader and NuGet.org's CommitCollector handle any time when a page or catalog item gets a new commitTimeStamp, even if it's a catalog item that already exists or a page that isn't the last. We would be losing this flexibility. This may be acceptable but since the catalog is not officially spec'd and there may be other implementations out there, it's hard to say whether this is a good idea.

2. Use the commitId as part of the cache key and cache pages and items forever

This retains the flexibility lost in option 1 but bloats the HTTP cache. There will be N copies of each page in the cache, where N is the number of different commits observed by the reader on that page.

This is a probably the simplest solution.

3. Store the commitId for all pages and items in an external store (JSON file?)

This avoids the bloat of option 2 but has additional complexity since now we have to invent a new data store thingy.

Conclusion

What are your thoughts?

Also, am I missing something here?

I like option 1 the best. When I get to documenting the V3 protocol, I hope to mandate that the only mutable catalog page is the last and that catalog items are immutable.

/cc @emgarten

I like the first option best also, I really wouldn't expect old pages to add new entries or change. The design is that only the newest page would change based on everything I know about the catalog.

Would the result of this be an API that mirrors the catalog itself locally and runs against that?

After thinking about this more, a catalog is append-only but there are times when it is rebuilt, or a new catalog is created. The reader/caching should be able to handle this.

For nuget.org the catalog would likely get put under a new base url which has happened maybe once or twice since the v3 feed was created. If the reader didn't use the base url as part of the cache key it would end up appending the catalog it had already read with the new catalog which would contain everything again. This may end up working, but it wouldn't be correct.

For Sleet the catalog will be rebuilt under the same base url, and that is expected when the feed is upgraded from one version to the next. Probably sleet should handle this better and append a unique id, but a reader could also add the start time of the catalog as a cache key.