withastro/roadmap

Referencing data from content collections

bholmesdev opened this issue ยท 29 comments

Details

Summary

Introduce a standard to store data separately from your content (ex. JSON files), with a way to "reference" this data from existing content collections.

Background & Motivation

Content collections are restricted to supporting .md, .mdx, and .mdoc files. This is limiting for other forms of data you may need to store, namely raw data formats like JSON.

Taking a blog post as the example, there will likely be author information thats reused across multiple blog posts. To standardize updates when, say, updating an author's profile picture, it's best to store authors in a separate data entry, with an API to reference this data from any blog post by ID.

The content collections API was built generically to support this future, choosing format-agnostic naming like data instead of frontmatter and body instead of rawContent. Because of this, expanding support to new data formats without API changes is a natural progression.

Use cases

We have a few use cases in mind considering data collections and data references. We expect this list to grow through the RFC discussion and learning from our community!

  • Blog post meta info. Common cases include author bios, project contributors, and tags
  • i18n translations. Many content sites and translation libraries work from key / value pairs stored as JSON. For example, an i18n/ collection containing en.json, fr.json, etc.
  • Image asset metadata. You may want to reference reusable alt text or image widths and heights for standard assets. For example, an images/banner.json file containing the src as a string, alt text, and a preferred width

Goals

  • Introduce JSON collection support, configurable and queryable with similar APIs to content collections.
  • Determine where data collections are stored. We may introduce a new src/data/ directory distinct from src/content/, or simply allow data collections within src/content/.
  • Introduce an API to reference this data from existing content collections by ID. This is based on the strongest user need for data collections: referencing metadata (ex. pull in post authors from a blog post).
  • Consider Both one-to-one and one-to-many relationships between content and data (ex. allow passing a list of author IDs in your frontmatter).

Non-goals

  • User-facing APIs to introduce new data collection formats like YAML or TOML. We recognize the value of community plugins to introduce new formats, and we will experiment with a pluggable API internally. Still, a finalized user-facing API will be considered out-of-scope.

We've discussed some initial examples of how data collections could work, including querying and referencing. This was informed through @tony-sull and I's work on the astro.build site.

Ex: Creating a collection of JSON

Say you have a collection of blog post authors you would like to store as JSON. You can create a new collection under src/data/ like so:

src/data/
  authors/
    ben.json
    fred.json
    matthew.json

This collection can be configured with a schema like any other content collection. To flag the collection as data-specific, we may expose a new defineDataCollection() helper:

// src/content/config.ts
import { defineDataCollection, z } from 'astro:content';

const authors = defineDataCollection({
	schema: z.object({
		name: z.string(),
		twitter: z.string().url(),
	})
});

export const collections = { authors };

It can also be queried like any other collection, this example using getDataCollection('authors'):

---
import { getDataCollection } from 'astro:content';
const authors = await getDataCollection('authors');
---
<ul>
{authors.map(author => (
	<li>
		<a href={author.data.twitter}>{author.data.name}</a>
	</li>
)}
</ul>

Return type

Data collections will return a subset of fields exposed by content collections:

type DataCollectionEntry<C> = {
  id: string;
  data: object;
  collection: C;
}

This omits a few key fields:

  • render(): this function is used by content collections to parse the post body into a usable Content component. Data collections have no HTML to render, so the function is removed.
  • slug: This is provided by content collections as a URL-friendly version of the file id, like a permalink. Since data collections are not meant to be used as pages, this is omitted.
  • body: Unlike content collections, which feature a post body separate from frontmatter, data collections are just... data. This field could still be returned as the raw JSON body, though this would give body a double meaning depending on the context: non-data information for content collections, and the "raw" data itself for data collections. We can avoid returning the body for an initial release to avoid this confusion.

Ex: Referencing data collections

Data collections could be referenced from existing content collection schemas. One example may be a reference() function (see @tony-sull 's early experiment) to reference data collection entries by slug from your frontmatter.

This example allows you to list all blog post authors from each blog post:

src/content/config.ts
import { defineCollection, defineDataCollection, reference, z } from 'astro:content'

const blog = defineCollection({
  schema: z.object({
    title: z.string(),
    authors: z.array(reference("authors")),
  })
});

const authors = defineDataCollection({
  schema: z.object({
    name: z.string(),
    avatar: image(),
  })
})

export const collections { blog, authors };

Then, authors can be referenced by slug from each blog entry's frontmatter. This should validate each slug and raise a readable error for invalid authors:

---
title: Astro 2.0 launched
authors:
- fred-schott
- ben-holmes
---

How do you differentiate between 1:1 and 1:N? Is relation always ["tony-sull"] and never "tony-sull"? It could be enough to implement 1:1 by doing relation("author").length(5) but then you're always stuck with an array on both the frontmatter and the query side of things.

We should try to support 1:1 relations if we can, or if we can't make sure that's mentioned in the RFC.

This may even make sense to add as an explicit goal of the stage 2 proposal, since I'd argue the query DX hit of always needing to unwrap an array for 1:1 relation isn't acceptable.

Other initial thoughts:

  • src/content/config.ts seems off, assuming a typo. Doesn't match format of https://docs.astro.build/en/guides/content-collections/#defining-a-collection-schema (maybe just missing defineCollection)
  • nit: I'm a fan of rel or ref instead of relation. I'm usually not a fan of abbreviations but in this case I think its consistent with how minimal z is. Also curious if "reference" makes more sense, espcially seeing that example where the default value is ["tony-sull"] (which is literally a reference to the tony-sull data object).
  • relation("authors").default(["tony-sull"]) That default took a second for me to parse, but I think that makes sense (vs. not supporting it at all). How much of the Zod API does relation() implement? Hopefully all of it?
  • User-facing APIs to introduce new data collection formats like YAML or TOML. We recognize the value of community plugins to introduce new formats, and we will experiment with a pluggable API internally. Still, a finalized user-facing API will be considered out-of-scope.

Another reason to avoid this for now: if the whole reference system works by referencing an id/slug, then having a single large CSV doesn't guarantee a column as id/slug. We'd need some additional config to define which column is the primary key.

@FredKSchott Thanks for the suggestions! Think I agree with all of these. Thoughts on the API design:

  1. That content config is valid, just writing the collections in-line instead of creating variables to export later. Copied from Tony's stage 1 proposal. Refactoring to the docs recommendations for readability.
  2. Agreed that ref or reference are better names. I lean towards reference to avoid colliding with state management concepts from Vue et al.
  3. I agree 1-1 vs 1-many should be a standalone goal! I'll admit I was wondering this too but left it out in the example. Playing with a few ideas, I'm liking this early design:
...
// Reference a single author by id (default)
author: reference('authors'),
// Reference multiple authors in a list of IDs
authors: reference('authors', { relation: 'many' }),
  1. The more I scope this RFC, the more I really like a separate src/data/ directory. This lets us play with ideas like removing the render() and slug conventions without much effort.

relation("authors").default(["tony-sull"]) That default took a second for me to parse, but I think that makes sense (vs. not supporting it at all). How much of the Zod API does relation() implement? Hopefully all of it?

Looking into this, I don't think we can support default() and optional() chaining in this way. Under-the-hood, reference() would transform the input ID to the actual data that ID points to:

export function reference(id) {
  const dataModule = import(resolveIdToDataImport(id));
  return z.object(dataModule);
}

This means helpers like .transform() and .refine() can work as expected, in case you want to massage data further (see the image().refine(...) example for our experimental assets). But since data is already resolved, default in particular wouldn't work (optional() might). I think each of these should be parameters instead, if supported:

...
authors: z.reference('authors', { default: 'ben-holmes' });

I'd definitely prefer the z.array(reference()) syntax if there aren't too many tradeoffs

For me I can't think of many uses for .default() or .transform() inside of z.array(), if that's the main tradeoff. I might want to default the array itself, but I'm not sure if .default() would ever run on an individual array item. For .transform(), would there be any performance improvement trying to transform each array item individually vs. transforming the full array once it resolves?

Thanks @tony-sull, I agree that's intuitive! I'm wrestling with whether reference() should A) transform IDs to your data directly with a Zod transform, or B) avoid the transform and return some flag to tell Astro "hey! Post-process this Zod object key please!". These are the tradeoffs in what Zod extension functions we could support:

// Solution a)
author: reference('authors').transform(data => ...), // โœ… works
author: reference('authors').refine(data => ...), // โœ… works
author: reference('authors').default('ben-holmes') // โŒ Doesn't work. Data already resolved!
authors: z.array(reference('authors')) // โŒ Doesn't work. Data already resolved!

// Solution b)
author: reference('authors').transform(data => ...), // โŒ Doesn't work. Data not resolved yet!
author: reference('authors').refine(data => ...), // โŒ Doesn't work. Data not resolved yet!
author: reference('authors').default('ben-holmes') // โœ… works
authors: z.array(reference('authors')) // โœ… works

If we want to have Zod both ways, we need to add configuration options for the functions we don't support.

Ex. if we support .transform() and .refine(), we'd need the following for array and default:

authors: reference('authors', { array: true, default: 'tony-sull' }),

From what I've seen, I expect users to lean on array and default more than transform and refine. So I'm starting to agree that is the better way to go ๐Ÿ‘

@tony-sull Just marked my comment above as outdated because... I'm wrong! Zod transforms run separately, so you can totally do z.array(...) around a transformer and have it still work. Now I'm 100% on-board with your suggestion ๐Ÿ‘

@bholmesdev Excellent! I wasn't actually sure if that setup would work, glad that does the trick! ๐Ÿš€

Discussion on single-file data collections vs. multi-file

There have been a few mentions of support a single file to store all data collection entries as an array, instead of splitting up entries per-file as we do content collections today. This would mean, say, a single src/content/authors.json file instead of a few src/content/authors/[name].json files.

Investigating this, I think it's best to stick with multiple files instead. Reasons:

  • Big arrays don't scale for complex data entries, like i18n translation files. Users will likely want to split up by file here, especially where the status quo is en.json | fr.json | jp.json ... for this use case.
  • We'd need to parse the whole array of entries just to figure out ids. This could be a performance bottleneck vs. pulling IDs from file names.
  • It would be different from content collections, which means a learning curve.

The reasons in favor of [collection].json:

  • Self-identifying data collections in the src/content/ folder. You can see which collections have data vs. which have content at a glance without opening the config file, which is convenient. The alternative for file-based identification when using file entries would be a src/data/ folder.
  • CSV support, where each row in a table would generate an entry. Though thinking it over, this feels like a special case that shouldn't dictate how JSON and YAML are stored.

I'd definitely lean towards multiple files for data collections, at least for the use cases I can think of. CSVs are an interesting one, but I could even see wanting multiple CSV files for something like a "database" of transactions grouped by month

src/content vs src/data is one I could definitely go either way on! Both have pros and cons, a couple ideas I thought about while listening to the community call today:

  • the main difference is a content collection entry gets a .render() function to generate HTML. Two separate directories may help make that distinction more clear

  • On the other hand, using one src/content folder would make an upgrade path easier if need to go from src/data/authors/ben.json to src/content/authors/ben.md to add something like content for an About page

I am working on a template for a talk I'll be giving at soon-to-be-announced conference, where I want to show people how they can setup an engineering blog for themselves or a multi-author blog for their company.

A feature like relational data from JSON would be really neat to have! In the meantime I manually hooked it all up based on a string ID.

A few notes on the earlier conversation here:

  • reference() makes sense to me
  • I like the idea of having multiple files
  • Separation between src/content and src/data makes sense to me

Just an idea I'd like to throw in here: what if you could do getCollection('blog') for the current behavior, and getCollection({ name: 'author', type: 'json' }) for the data collection? I haven't looked at the internals of getCollection much but from my perspective as mostly an Astro user it makes sense, and it would open up possibilities for other type options without adding even more getThingCollection() functions and having to import those.

Ah, that's great to hear @EddyVinck! I'm working to have a preview release by end-of-day tomorrow. I'll share that branch here once it's up.

Just an idea I'd like to throw in here: what if you could do getCollection('blog') for the current behavior, and getCollection({ name: 'author', type: 'json' }) for the data collection?

This is an interesting idea! Though I will admit, I'm not sure if type should be tied to file extensions. The goal is to separate based on shape of the response (note content and data have different return shapes), and each file extension of a given type adheres to that shape. In other words, file extension shouldn't matter when you're querying; just the shape of the response. So far I've considered 3 possible types:

  • content - Markdown, MDX, and Markdoc as we have today. These feature both data and a render-able body.
  • data - JSON, YAML, CSVs, and other data types. These feature data alone, without a render-able body.
  • (future idea?) page - Content intended to be used as pages. These feature data and a render-able body, along with properties for mapping URLs, like a permalink.

I also worry that the type API reads like a type cast, implying you could import a content collection as a data collection. Though I also see a parallel to import assertions which could be nice. Either way, since I don't see us having more than the 3 shapes outlined above, I think getDataCollection() is a compromise to avoid breaking changes. We'll think over it though!

Well I'm a man of my word! Here's a preview branch + video overview of the new data collection APIs. Still waiting on the preview release CI action (something's holding it up...) but you should be able to clone the repo and try the examples/with-data starter ๐Ÿš€

withastro/astro#6850

Looks like the link got cut off, here's the actual PR: withastro/astro#6850

Discussion on src/data/ vs. src/content/

We've considered two options for storying data collections: using the same src/content/ directory we have today, or introducing a new, separate src/data/ directory.

Why src/data/

  • Follows patterns for storing arbitrary JSON or YAML in 11ty (see the _data/ convention).
  • Clearly defines how data and content are distinct concepts that return different information (ex. content includes a body and a render() utility, while data does not). This makes data-specific APIs like getDataEntryById() easier to conceptualize.
  • Avoids confusion on whether mixing content and data in the same collection is supported; Collections should be distinctly content or data. We can add appropriate error states for src/content/, but using the directory to define the collection type creates a pit of success.

Why src/content/

  • Follows Nuxt content's pattern for storing data in a content/ directory
  • Avoids a new reserved directory. This could mean a simpler learning curve, i.e. I already know collections live in src/content/, so I'll add my new "data" collection in this directory too. From user testing with the core team, this expectation arose a few times.
  • Allows switching from JSON to MD and back again more easily, without moving the directory. Example: you find a need for landing pages and bios for your authors data collection, so you move json -> md while retaining your schema.
  • Avoids the need for moving the collection config to your project root. With src/data/, requiring the config to live in a src/content/config.ts is confusing. This is amplified when you do not have any content collections.

Conclusion

There are compelling pros on both sides. Today, we have a deploy preview using src/data/ to get user feedback before making a final decision. Though based on the API bash with our core team and feedback below, using src/content/ for everything could be the more intuitive API.

This one's really minor, but I also like that reusing src/content/ means Astro isn't claiming another special directory in src

That means one less major breaking change and less chance of getting in the user's way if someone wants their own src/data directory

Using src/content also allows for colocating your data with your content, where you could even put frontmatter you might end up putting in a markdown file in a separate json data file

@jasikpark So as part of this, I don't expect data and content to be able to live in the same collection. We'd require users to specify the type of a collection in their config file (i.e. defineCollection({ type: 'data' ... })). This is arguably a point against supporting everything in src/content/ since it may encourage users to try this pattern when it is not supported.

ohhhhhh i forgot that

my-post/
	content.mdoc
	post-image.jpeg
	post-image2.webp
	data.json

wouldn't be supported...

hmm i dunno how i feel about either directory then ๐Ÿ™ˆ

@jasikpark Well that would be supported still, as long as you add an underscore _ to the file name to mark as ignored in our type checker. We won't run mixed content and data through the same Zod schema. So colocation is fine, but mixed validation is not

Ok - I guess I've been thinking of a collection entry as a folder rather than a markdoc file ๐Ÿ˜“ good to understand that better

@jasikpark That's a model keystatic has adopted actually! Since nested directories are used for slug / id generation, we haven't used this same model. It almost reminds me of the NextJS app/ directory vs. our current routing story.

cool, i'll play around w/ that then - thx for all the responses ๐Ÿ’œ

how does it make you think of the app folder for nextjs?

@jasikpark Well, it's the difference between file vs. directory-based routing. I think content collections and Astro's pages/ router have a lot of parallels, including the underscore _ for colocation. The other solution is to use wrapper directories for everything, where key files have a special name (like page.tsx or content.mdoc) with colocation allowed for any other files. I've heard thoughts on supporting both, which is interesting!

ooohhhh thx for clarifying, that's interesting yeah

I'm more in favour of src/content, mostly because reserving another directory feels too much. This doesn't mean that we can't have an src/content/data folder, although it might NOT make sense because they are two different concepts.

Thanks for the input y'all! Implemented src/content/ on the latest PR. Stage 3 RFC to come

Closing as this is completed and in stable.