bskinn/sphobjinv

idea: sphinx objects inventory shed

Opened this issue · 3 comments

Hi there!

First, thank you a lot for sphobjinv! It has proven most useful in countless situations for me.

Since you have been playing with inventories alot, I would like your feedback on an idea: a sphinx object inventory shed.

I am looking for ways to help colleagues write their technical documentation, especially in complex multi platforms environments (think cloud microservices). I find Sphinx is the best compromise for independent projects in multiple domains linking each others, building an hypermedia documentation. I also think that MkDocs is getting attraction because of its simplicity and because people love markdown. The major downside here in my opinion is the lack of inter-project reference engine (like intersphinx), especially since Sphinx users now have MyST for markdown support.

  • Large initiatives in corporation or within a network of private or opensource organizations need to be able to choose their platform (programming language, etc) for their project.
  • Documentation is hard to maintain, but critical, especially for distributed teams
  • The features offered by Sphinx and intersphinx could do the trick, but lack proper support and community engagement for non-python projects (even for some extremely nice python projects, like starlette and pydantic (see also pydantic/pydantic#1339) to name a few).
  • intersphinx could in time be used for more than API domains, terms (within glossaries) is an example.

To help solve this, inspired by sphobjinv and typeshed, I am thinking of a Sphinx objects inventory shed (or soished for short)

This project could provide a few things:

  • A documentation with examples on how to use Sphinx in multi-projects, multi-language use cases
  • A python API, leveraging sphobjinv to help generate objects.inv files
  • Hosted within that documentation, some objects.inv files for major projects lacking objects.inv (like Javascript primitives from Mozilla MDN)
  • Maybe a sphinx extension to ease the configuration of intersphinx in non trivial scenarios.

My commitment into this would be quite modest, but I feel that within a few weeks I could make something up to get started.

Do you ( @bskinn ), or anyone reading this, have any opinion about that idea?

Awesome! It's always great to hear that a project has been useful to someone other than just myself. :-)


High-level, I think the objective behind this (making it much easier for Sphinx documentarians to cross-reference into third-party, non-Sphinx document(ation) sets) is a really great idea -- Sphinx's robust cross-referencing functionality is really nice, and the recent emergence of MyST is a tremendous addition to the Sphinx ecosystem. A system that provides robust, searchable, discoverable objects.inv for documentation/documents not written in Sphinx seems like it might be hugely valuable.

I have a number of thoughts percolating on the idea...I'll post further here once I've pulled them together. I think one of the key questions is whether it actually makes sense to try to implement it as a centralized repository, as opposed to a set of advanced tooling for auto-creation of objects.invs from webscraped data (part of this had already occurred to me at a high level; see #19), which each Sphinx documentarian would configure and use themselves... freshness of the data in the objects.invs seems likely to be a significant issue, as well as the maintenance workload of the configuration of the webscraper spiders themselves.

Thank you for this feedback. I am myself still working my mind around this. I feel that some hindrances to this objectives could be lifted with a few changes in sphinx.ext.intersphinx itself (like sphinx-doc/sphinx#5562 as an example)

Sorry for the slow reply on this!

I've been mulling this idea over, and aside from implementing a robust and simple way for people to set up suitable web scraping (I still like the name soiscraper for such a project...), I think the biggest challenge is finding a good way to manage freshness/staleness of the objects.inv files that are made available.

For Sphinx docsets, e.g. on ReadTheDocs, there's a guarantee of freshness in the objects.inv that lives with the docset. Every time the docs are built, a fresh objects.inv is produced, and so a user can be sure that the objects.inv they find with the docset is "fresh" ... it definitely contains an accurate representation of the documented artifacts and where they live in the HTML directory tree.

For an objects.inv that's created by the still-hypothetical soiscraper and hosted in a central shed, though, there are two flavors of "staleness" that can develop for it:

  1. Changes could have been made to the documentation set associated with the .inv, and so the soished's .inv is stale with respect to the actual documentation up on the web.
  2. If a documentarian downloaded that objects.inv some time ago (two days, two weeks, two months, ...), then their local copy may be stale with respect to the objects.inv that's currently hosted at soished.

If clean and sufficiently inexpensive ways can be figured out to manage these two stalenesses, then I think the idea is probably solid.

In terms of a soished implementation, Item 1 is the bigger deal; because if someone just sets their conf.py up to point at the soished objects.inv, then they'll get a fresh inventory whenever they build after make clean, or using make -e, and that ~takes care of Item 2.

There are also cloud services cost aspects for both of these items:

  • Item 1 has the potential to require a lot of CPU and inbound & outbound bandwidth if the soished objects.inv files are rebuilt on a frequent basis;
  • Item 2 will involve a lot of outbound bandwidth if the soished gains traction, because per Sphinx default behavior, every time someone builds their docset after make clean or with the -e flag, the remote objects.inv files get downloaded again.

CDN caching would probably make sense for Item 2.

Some sort of intelligent microscraping of the website hierarchy under documentation, built into soiscraper or soished or both, that provides a guess as to whether it's been updated since the last scrape, might help reduce cloud billing beyond simply setting a conservative re-scrape interval (whether fixed or customizable per target docset). I don't have much experience with cloud, though, so it could be that the web scraping is low-traffic enough that it makes the most sense to just fully rescrape and regenerate on, say, a six-hour interval.