rust-lang/rust

Tracking issue for RFC 2963: rustdoc JSON backend

Manishearth opened this issue ยท 21 comments

RFC PR: rust-lang/rfcs#2963
RFC: https://rust-lang.github.io/rfcs/2963-rustdoc-json.html
Documentation: https://doc.rust-lang.org/nightly/nightly-rustc/rustdoc_json_types/
Issues: A-rustdoc-json Area: Rustdoc JSON backend

Todo:

  • Implement a JSON backend according to the rough spec in the RFC
  • Experiment with the JSON backend, improving the design as necessary
  • Potentially open a new RFC with the final design, and stabilize it

cc @P1n3appl3 if you want to help fill in substeps for the tracking issue and/or help implement them

Implemented in #79539

@rustbot modify labels: +A-rustdoc-json

Encountered an issue while trying to jsonify bevy:
#98465

Edit: Solved

Some blocker's from zulip discussion:

  • Settle on a meta-format for format_version
    • Decide how do deal with nightly only lang features
  • jsondoclint passes on core + friends (#106435)
  • Ensure rustdoc-json-types is fully documented
  • decide what do do with https://crates.io/crates/rustdoc-types
  • robust cross-crate item lookup
  • go over names again
  • very carefully consider impact of new language features, to ensure we can add them without breaking anyone. syn-2's model and release note's are helpfull for this.
Xuanwo commented

A quick note for users who want to try this feature:

cargo +nightly rustdoc --lib -p <your-package> -- -Z unstable-options --output-format json

I notice that the output from this doesn't include items in stdlib (just references to them), just like the old -Zsave-analysis feature. With -Zsave-analysis, you could get the stdlib analysis from the rust-analysis rustup component. Is there a way to generate it using rustdoc, and/or are there any plans to ship a similar rustup component to fill that gap?

I notice that the output from this doesn't include items in stdlib (just references to them), just like the old -Zsave-analysis feature. With -Zsave-analysis, you could get the stdlib analysis from the rust-analysis rustup component. Is there a way to generate it using rustdoc, and/or are there any plans to ship a similar rustup component to fill that gap?

There's a nightly-only component called rust-docs-json that includes rustdoc for built-in modules like core, std, alloc etc.

rustup component add rust-docs-json --toolchain nightly

Depending on the details of your use case, it's also possible that some of the machinery that powers cargo-semver-checks may be of use. For example, here's a playground query for "items with allowed lints defined in core." I maintain cargo-semver-checks and its underlying infrastructure, and would be happy to chat if this looks interesting โ€” I'm also in the rust-lang Zulip if that's easier. This is subject matter I find professionally interesting, so I'd love to hear about what you're looking to build!

My usecase is rsbrowse, a TUI interactive code browser. I implemented it using -Zsave-analysis (so currently it doesn't work, unless you can point it at a really old compiler), and the rustdoc json output seems like a perfect replacement for that.

I had initially tried replacing it with rust-analyzer's LSIF output, but that's primarily intended for going from source code text -> type info, whereas I really want the reverse, and it's not good for that.

Very cool! I hadn't heard of rsbrowse before. I think rsbrowse and cargo-semver-checks run into the exact same combination of issues: resolving and querying items, which are possibly across crate boundaries, and then pointing to specific bits of code as needed (whether to explore or to flag a semver issue). I'm hopeful that they might be able to share a solution as well, if you're interested!

I recently gave a talk on the architecture behind cargo-semver-checks which might be of interest: https://www.youtube.com/watch?v=Fqo8r4bInsk

In any case, I have quite a bit of experience with rustdoc JSON, including how to minimize the amount of work needed to stay up to date on rustdoc format versions which change rather frequently โ€” on average, ~once per Rust release. There's also a limitation in rustdoc JSON at the moment that makes it somewhat challenging to resolve items across crate boundaries, but I think it's surmountable. Give me a ping if I can help, or if you're interested in building atop some of the same infrastructure that powers cargo-semver-checks and our playground!

I took a couple days and reimplemented rsbrowse's backend to use rustdoc json, and overall it is a perfect fit! I found it much easier using this than the RLS save-analysis data (for my use-case at least). Even implementing cross-crate lookup was easy: my program simply looks in the local crate's .index first, then if it's not found, in .paths, then uses the first component to figure out which other json file to look in, and then matches based on the full path. It works great, as long as you remember to always pass around a crate name with an item's ID :)

One problem I've found so far is that the stdlib files seem to be missing things
from their ".paths" map which are referenced from elsewhere in the json, which makes reliably figuring out full paths difficult, but this can be worked around.

EDIT: I just realized, it's probably because the stdlib files were generated without --document-private-items and --document-hidden-items being specified. When rsbrowse invokes rustdoc for the workspace crates, it passes these flags, so I never saw missing paths until I started loading the stdlib files. Maybe it would make sense to generate those files with all info included instead?

There also doesn't seem to be a way to get it to generate json for build-dependencies like syn. I'm calling cargo doc --workspace and passing a RUSTDOCFLAGS to enable JSON output, and it gets most things, but not build- or dev-dependencies. Maybe I'll have to resort to parsing cargo metadata and calling rustdoc manually?

I'm also not sure how it handles multiple versions of a crate in a dependency tree.

But overall, for me, this feature is like 95% perfect, and I can pretty easily work around the remaining rough edges. I'm very happy with the data format!

example errors showing missing paths in `core`
missing path for ItemId(CrateId { name: "core" }, Id("0:3244:262")) (Try)
missing path for ItemId(CrateId { name: "core" }, Id("0:7480:163")) (IntoIterator)
missing path for ItemId(CrateId { name: "core" }, Id("0:7480:163")) (IntoIterator)
missing path for ItemId(CrateId { name: "core" }, Id("0:7480:163")) (IntoIterator)
missing path for ItemId(CrateId { name: "core" }, Id("0:3249:143")) (FromResidual)
missing path for ItemId(CrateId { name: "core" }, Id("0:7443:19788")) (Product)
missing path for ItemId(CrateId { name: "core" }, Id("0:3249:143")) (FromResidual)
missing path for ItemId(CrateId { name: "core" }, Id("0:7439:19789")) (Sum)
missing path for ItemId(CrateId { name: "core" }, Id("0:7476:142")) (FromIterator)
missing path for ItemId(CrateId { name: "core" }, Id("0:3255:14999")) (Residual)

Agree with @wfraser, I've been doing a big re-write of my code gen crate, and took the time to properly crawl the output for things I am interested in. My biggest issue so far is cross-crate lookups, my use case requires analyzing this json from multiple crates and then trying to match them up - specifically traits and their impls across crates.

Everything works fine inside the one local crate, but ID's are not stable across multiple crates (actually sometimes they seem to line up, but not always), this means that if I have an Item for which I only have the .paths reference, in order to look it up across crates ,I need to have a tree of all off the paths I crawled in all other crates, and look up this path in those paths. If there is a match I can use it.

This lets me generate public import paths in code gen. It would be nice to have some sort of stable ID which works across crates, maybe a combination of crate name and the ID

I'm also not sure how it handles multiple versions of a crate in a dependency tree.

AFAIK there's currently no guaranteed way to resolve cross-crate imports across multiple versions of the same crate. This is why cargo-semver-checks doesn't support cross-crate analysis at the moment.

I believe @LukeMathWalker was looking into adding a way to reliably look up items across crate boundaries. At the moment I can't seem to find the link to the thread where that was discussed, so I'm not sure what the status on that is.

My strategy to overcome this limitation relies on rust-lang/compiler-team#635 (or some variations of it) landing.
We haven't yet found consensus on how to approach it and I recently didn't have the time to do further research and create momentum for it.

See the first section of #106697, for what I want to eventually do on this. (And more broadly, don't bet on anything in that issue getting finished this year, it's been alot)

orium commented

It seems that inner items, such as struct fields or enum variants are not included in crate.paths. This makes it particularly annoying mapping an item id to a item path. For context I'm the author of cargo-rdme and this tool needs to go from intralink to item path (via item id).

For instance, if we have this code:

struct MyStruct {
    my_field: u64,
}

we will get this json (some fields omitted for brevity):

{
  "index": {
    "0:5:1777": {
      "id": "0:5:1777",
      "name": "my_field",
      "inner": { "struct_field": { "primitive": "u64" } }
    },
    "0:4:1776": {
      "id": "0:4:1776",
      "name": "MyStruct",
      "inner": { "struct": { "kind": { "plain": { "fields": [ "0:5:1777" ] } } } }
    }
  },
  "paths": {
    "0:4:1776": {
      "path": [ "foo", "MyStruct"],
      "kind": "struct"
    }
  },
  "format_version": 28
}

Note that there is no path for foo::MyStruct::my_field: it only shows up in crate.index. To figure out the path of my_field (id 0:5:1777) I have to traverse the index, and find out which struct has inner.struct.kind.plain.fields with id 0:5:1777. Then I have to get the path of the struct I found, in this case crate.paths["0:4:1776"], and append the field name to get foo::MyStruct::my_field.

Ideally we would have the field in crate.paths, just like the struct. That would make the format very easy to use. Another way (which is slightly less convenient for my use case) is to have aparent field in the item of my_field in the index, pointing at the parent item (MyStruct):

{
  "index": {
    "0:5:1777": {
      "id": "0:5:1777",
      "name": "my_field",
      "parent": "0:4:1776"
    },
    "0:4:1776": {
      "id": "0:4:1776",
      "name": "MyStruct",
    }
  }
}

That would also make it easy to go from field to the corresponding struct and then get the path of that struct.

Of course, in an ideal world, we would have both: inner item paths in crate.paths as well as a parent field in the index.

๐Ÿ‘‹ I'm the maintainer of cargo-semver-checks and I've had to solve some related problems. I'm not a maintainer of rustdoc or any "official" Rust components, so this is just my personal 2 cents.

Unfortunately, I think that crate.paths is much more likely to be removed than improved. It has a number of issues, for example:

  • The set of paths for a given item can sometimes be infinite -- I've seen it in multiple real-world crates!
  • It's hard to pick a "single canonical path" to show in crate.paths, since that path might be private / from a foreign crate / a type alias of another item / another edge case.
  • Some crates are just humongous and already as is generate 300MB+ of rustdoc JSON. Adding fields to crate.paths would bloat those files even more, and not by a small amount.

This is inherent complexity in the domain, and I don't think rustdoc JSON will be able to handle it for us.

The way cargo-semver-checks addresses this is by using a query engine that internally handles the necessary name and import path resolution, and lets us use a declarative query language where we get to just take those things for granted. Here's a playground link where you can see how we can look up structs with their import paths and fields. Obviously, you can look up items and their parents by ID as well.

I've already implemented and thoroughly tested all this logic, and I'd be happy to help you get started with it if you're open to trying it out!

P.S.: cargo-rdme is cool, TIL about it ๐Ÿคฉ Thanks for building it!

orium commented

Thanks @obi1kenobi for your answer. I wasn't aware the number of path could explode and even be infinite. I also wasn't aware of the trustfall query language: it seem that it might be useful for what I'm doing (thanks for creating that!).

Given this information, I want to change my suggestion to something a bit more selfish (because it more directly solves my problem). Since an Item has links, it could provide more useful information. Currently it simply gives out the item id of each link. But since the mapping between a mapping id and a canonical path seems to be a hard to solve, I would argue it is reasonable for the item path, as used by rustdoc to create links, to be included in Item::links. Basically, instead of having

{
  "index": {
    "0:0:1779": {
      "links": {
        "MyStruct::my_field": "0:5:1777"
      }
    }
  }
}

we could have

{
  "index": {
    "0:0:1779": {
      "links": {
        "MyStruct::my_field": { "id": "0:5:1777", "path": ["foo", "MyStruct", "my_field"] }
      }
    }
  }
}

(An even more selfish suggestion would be for the html link to be part of Item::links, e.g. foo/struct.MyStruct.html#structfield.my_field, but that's problably too specific to be part of the rustdoc json output.)

(Again speaking in a purely personal capacity.)

Both suggestions have the effect of increasing the file size in order to denormalize the format. It's a denormalization because this info was already available elsewhere, and it's being duplicated for convenience.

It is my perception that both size increases and denormalization come with strong negative externalities, and are unlikely to happen. For example:

  • There are ongoing conversations about docs.rs hosting rustdoc JSON files in addition to the rustdoc HTML โ€” this would be obviously useful for both our tools, right? But the bigger the JSON file size, the harder that's going to be to pull off and the more it's going to cost.
  • Past "duplicated for convenience" rustdoc JSON features (e.g. item inlining) were the source of many frustrating bugs. Getting rid of those bugs (often by eliminating the duplication) was a substantial focus of rustdoc JSON work over the last year and a half or so. Putting denormalizations back in is likely to cause many new such bugs, which will be frustrating for both us as users and for the rustdoc team.

I think it's in our best interest as rustdoc users to make sure rustdoc maintainers can focus on high-impact changes that unlock new capabilities for us. We can add ease-of-use denormalizations one abstraction layer above rustdoc JSON itself, for example via a layer like Trustfall and its rustdoc adapter.

If the links connection to other items would be useful to have, it would take less than 10min to expose it via Trustfall and make it available for querying like so:

query {
  Crate {
    item {
      ... on Struct {
        struct_id: id @output
        struct_name: name @output

        link {
          linked_item_type: __typename @output
          linked_item_id: id @output
          linked_item_name: name @output
        }
      }
    }
  }
}

I'd be happy to add it if it would be useful to you.

PR adding the above to the Trustfall schema for rustdoc: obi1kenobi/trustfall-rustdoc-adapter#308

Lmk what you think!

Referring to Zulip discussion Where to find keyword entries in JSON rustdoc, currently it seems not possible to include keywords in the JSON rustdoc. I am not sure how hard it would be to implement this feature, but hopefully it will be considered before we stabilize it.