Referencing data from content collections
bholmesdev opened this issue ยท 29 comments
Details
- Accepted Date: 24/03/23
- Reference Issues/Discussions: #477, #525
- Author: @bholmesdev
- Core Champions: @bholmesdev, @tony-sull
- Implementation PR:
Summary
Introduce a standard to store data separately from your content (ex. JSON files), with a way to "reference" this data from existing content collections.
Background & Motivation
Content collections are restricted to supporting .md
, .mdx
, and .mdoc
files. This is limiting for other forms of data you may need to store, namely raw data formats like JSON.
Taking a blog post as the example, there will likely be author information thats reused across multiple blog posts. To standardize updates when, say, updating an author's profile picture, it's best to store authors in a separate data entry, with an API to reference this data from any blog post by ID.
The content collections API was built generically to support this future, choosing format-agnostic naming like data
instead of frontmatter
and body
instead of rawContent
. Because of this, expanding support to new data formats without API changes is a natural progression.
Use cases
We have a few use cases in mind considering data collections and data references. We expect this list to grow through the RFC discussion and learning from our community!
- Blog post meta info. Common cases include author bios, project contributors, and tags
- i18n translations. Many content sites and translation libraries work from key / value pairs stored as JSON. For example, an
i18n/
collection containingen.json
,fr.json
, etc. - Image asset metadata. You may want to reference reusable
alt
text or image widths and heights for standard assets. For example, animages/banner.json
file containing thesrc
as a string, alt text, and a preferredwidth
Goals
- Introduce JSON collection support, configurable and queryable with similar APIs to content collections.
- Determine where data collections are stored. We may introduce a new
src/data/
directory distinct fromsrc/content/
, or simply allow data collections withinsrc/content/
. - Introduce an API to reference this data from existing content collections by ID. This is based on the strongest user need for data collections: referencing metadata (ex. pull in post authors from a blog post).
- Consider Both one-to-one and one-to-many relationships between content and data (ex. allow passing a list of author IDs in your frontmatter).
Non-goals
- User-facing APIs to introduce new data collection formats like YAML or TOML. We recognize the value of community plugins to introduce new formats, and we will experiment with a pluggable API internally. Still, a finalized user-facing API will be considered out-of-scope.
We've discussed some initial examples of how data collections could work, including querying and referencing. This was informed through @tony-sull and I's work on the astro.build site.
Ex: Creating a collection of JSON
Say you have a collection of blog post authors you would like to store as JSON. You can create a new collection under src/data/
like so:
src/data/
authors/
ben.json
fred.json
matthew.json
This collection can be configured with a schema like any other content collection. To flag the collection as data-specific, we may expose a new defineDataCollection()
helper:
// src/content/config.ts
import { defineDataCollection, z } from 'astro:content';
const authors = defineDataCollection({
schema: z.object({
name: z.string(),
twitter: z.string().url(),
})
});
export const collections = { authors };
It can also be queried like any other collection, this example using getDataCollection('authors')
:
---
import { getDataCollection } from 'astro:content';
const authors = await getDataCollection('authors');
---
<ul>
{authors.map(author => (
<li>
<a href={author.data.twitter}>{author.data.name}</a>
</li>
)}
</ul>
Return type
Data collections will return a subset of fields exposed by content collections:
type DataCollectionEntry<C> = {
id: string;
data: object;
collection: C;
}
This omits a few key fields:
- render(): this function is used by content collections to parse the post body into a usable
Content
component. Data collections have no HTML to render, so the function is removed. - slug: This is provided by content collections as a URL-friendly version of the file
id
, like a permalink. Since data collections are not meant to be used as pages, this is omitted. - body: Unlike content collections, which feature a post body separate from frontmatter, data collections are just... data. This field could still be returned as the raw JSON body, though this would give
body
a double meaning depending on the context: non-data information for content collections, and the "raw" data itself for data collections. We can avoid returning thebody
for an initial release to avoid this confusion.
Ex: Referencing data collections
Data collections could be referenced from existing content collection schemas. One example may be a reference() function (see @tony-sull 's early experiment) to reference data collection entries by slug from your frontmatter.
This example allows you to list all blog post authors from each blog post:
src/content/config.ts
import { defineCollection, defineDataCollection, reference, z } from 'astro:content'
const blog = defineCollection({
schema: z.object({
title: z.string(),
authors: z.array(reference("authors")),
})
});
const authors = defineDataCollection({
schema: z.object({
name: z.string(),
avatar: image(),
})
})
export const collections { blog, authors };
Then, authors can be referenced by slug from each blog
entry's frontmatter. This should validate each slug and raise a readable error for invalid authors:
---
title: Astro 2.0 launched
authors:
- fred-schott
- ben-holmes
---
How do you differentiate between 1:1
and 1:N
? Is relation
always ["tony-sull"]
and never "tony-sull"
? It could be enough to implement 1:1
by doing relation("author").length(5)
but then you're always stuck with an array on both the frontmatter and the query side of things.
We should try to support 1:1
relations if we can, or if we can't make sure that's mentioned in the RFC.
This may even make sense to add as an explicit goal of the stage 2 proposal, since I'd argue the query DX hit of always needing to unwrap an array for 1:1
relation isn't acceptable.
Other initial thoughts:
src/content/config.ts
seems off, assuming a typo. Doesn't match format of https://docs.astro.build/en/guides/content-collections/#defining-a-collection-schema (maybe just missingdefineCollection
)- nit: I'm a fan of
rel
orref
instead ofrelation
. I'm usually not a fan of abbreviations but in this case I think its consistent with how minimalz
is. Also curious if "reference" makes more sense, espcially seeing that example where the default value is["tony-sull"]
(which is literally a reference to thetony-sull
data object). relation("authors").default(["tony-sull"])
That default took a second for me to parse, but I think that makes sense (vs. not supporting it at all). How much of the Zod API doesrelation()
implement? Hopefully all of it?
- User-facing APIs to introduce new data collection formats like YAML or TOML. We recognize the value of community plugins to introduce new formats, and we will experiment with a pluggable API internally. Still, a finalized user-facing API will be considered out-of-scope.
Another reason to avoid this for now: if the whole reference system works by referencing an id
/slug
, then having a single large CSV doesn't guarantee a column as id/slug. We'd need some additional config to define which column is the primary key.
@FredKSchott Thanks for the suggestions! Think I agree with all of these. Thoughts on the API design:
- That content config is valid, just writing the collections in-line instead of creating variables to export later. Copied from Tony's stage 1 proposal. Refactoring to the docs recommendations for readability.
- Agreed that
ref
orreference
are better names. I lean towardsreference
to avoid colliding with state management concepts from Vue et al. - I agree 1-1 vs 1-many should be a standalone goal! I'll admit I was wondering this too but left it out in the example. Playing with a few ideas, I'm liking this early design:
...
// Reference a single author by id (default)
author: reference('authors'),
// Reference multiple authors in a list of IDs
authors: reference('authors', { relation: 'many' }),
- The more I scope this RFC, the more I really like a separate
src/data/
directory. This lets us play with ideas like removing therender()
andslug
conventions without much effort.
relation("authors").default(["tony-sull"]) That default took a second for me to parse, but I think that makes sense (vs. not supporting it at all). How much of the Zod API does relation() implement? Hopefully all of it?
Looking into this, I don't think we can support default()
and optional()
chaining in this way. Under-the-hood, reference()
would transform the input ID to the actual data that ID points to:
export function reference(id) {
const dataModule = import(resolveIdToDataImport(id));
return z.object(dataModule);
}
This means helpers like .transform()
and .refine()
can work as expected, in case you want to massage data further (see the image().refine(...)
example for our experimental assets). But since data is already resolved, default
in particular wouldn't work (optional()
might). I think each of these should be parameters instead, if supported:
...
authors: z.reference('authors', { default: 'ben-holmes' });
I'd definitely prefer the z.array(reference())
syntax if there aren't too many tradeoffs
For me I can't think of many uses for .default()
or .transform()
inside of z.array()
, if that's the main tradeoff. I might want to default the array itself, but I'm not sure if .default()
would ever run on an individual array item. For .transform()
, would there be any performance improvement trying to transform each array item individually vs. transforming the full array once it resolves?
Thanks @tony-sull, I agree that's intuitive! I'm wrestling with whether reference()
should A) transform IDs to your data directly with a Zod transform, or B) avoid the transform and return some flag to tell Astro "hey! Post-process this Zod object key please!". These are the tradeoffs in what Zod extension functions we could support:
// Solution a)
author: reference('authors').transform(data => ...), // โ
works
author: reference('authors').refine(data => ...), // โ
works
author: reference('authors').default('ben-holmes') // โ Doesn't work. Data already resolved!
authors: z.array(reference('authors')) // โ Doesn't work. Data already resolved!
// Solution b)
author: reference('authors').transform(data => ...), // โ Doesn't work. Data not resolved yet!
author: reference('authors').refine(data => ...), // โ Doesn't work. Data not resolved yet!
author: reference('authors').default('ben-holmes') // โ
works
authors: z.array(reference('authors')) // โ
works
If we want to have Zod both ways, we need to add configuration options for the functions we don't support.
Ex. if we support .transform()
and .refine()
, we'd need the following for array
and default
:
authors: reference('authors', { array: true, default: 'tony-sull' }),
From what I've seen, I expect users to lean on array
and default
more than transform
and refine
. So I'm starting to agree that is the better way to go ๐
@tony-sull Just marked my comment above as outdated because... I'm wrong! Zod transforms run separately, so you can totally do z.array(...)
around a transformer and have it still work. Now I'm 100% on-board with your suggestion ๐
@bholmesdev Excellent! I wasn't actually sure if that setup would work, glad that does the trick! ๐
Discussion on single-file data collections vs. multi-file
There have been a few mentions of support a single file to store all data collection entries as an array, instead of splitting up entries per-file as we do content collections today. This would mean, say, a single src/content/authors.json
file instead of a few src/content/authors/[name].json
files.
Investigating this, I think it's best to stick with multiple files instead. Reasons:
- Big arrays don't scale for complex data entries, like i18n translation files. Users will likely want to split up by file here, especially where the status quo is
en.json | fr.json | jp.json ...
for this use case. - We'd need to parse the whole array of entries just to figure out ids. This could be a performance bottleneck vs. pulling IDs from file names.
- It would be different from content collections, which means a learning curve.
The reasons in favor of [collection].json
:
- Self-identifying data collections in the
src/content/
folder. You can see which collections have data vs. which have content at a glance without opening the config file, which is convenient. The alternative for file-based identification when using file entries would be asrc/data/
folder. - CSV support, where each row in a table would generate an entry. Though thinking it over, this feels like a special case that shouldn't dictate how JSON and YAML are stored.
I'd definitely lean towards multiple files for data collections, at least for the use cases I can think of. CSVs are an interesting one, but I could even see wanting multiple CSV files for something like a "database" of transactions grouped by month
src/content
vs src/data
is one I could definitely go either way on! Both have pros and cons, a couple ideas I thought about while listening to the community call today:
-
the main difference is a content collection entry gets a
.render()
function to generate HTML. Two separate directories may help make that distinction more clear -
On the other hand, using one
src/content
folder would make an upgrade path easier if need to go fromsrc/data/authors/ben.json
tosrc/content/authors/ben.md
to add something like content for an About page
I am working on a template for a talk I'll be giving at soon-to-be-announced conference, where I want to show people how they can setup an engineering blog for themselves or a multi-author blog for their company.
A feature like relational data from JSON would be really neat to have! In the meantime I manually hooked it all up based on a string ID.
A few notes on the earlier conversation here:
reference()
makes sense to me- I like the idea of having multiple files
- Separation between
src/content
andsrc/data
makes sense to me
Just an idea I'd like to throw in here: what if you could do getCollection('blog')
for the current behavior, and getCollection({ name: 'author', type: 'json' })
for the data collection? I haven't looked at the internals of getCollection
much but from my perspective as mostly an Astro user it makes sense, and it would open up possibilities for other type
options without adding even more getThingCollection()
functions and having to import those.
Ah, that's great to hear @EddyVinck! I'm working to have a preview release by end-of-day tomorrow. I'll share that branch here once it's up.
Just an idea I'd like to throw in here: what if you could do getCollection('blog') for the current behavior, and getCollection({ name: 'author', type: 'json' }) for the data collection?
This is an interesting idea! Though I will admit, I'm not sure if type
should be tied to file extensions. The goal is to separate based on shape of the response (note content
and data
have different return shapes), and each file extension of a given type adheres to that shape. In other words, file extension shouldn't matter when you're querying; just the shape of the response. So far I've considered 3 possible types:
content
- Markdown, MDX, and Markdoc as we have today. These feature both data and a render-able body.data
- JSON, YAML, CSVs, and other data types. These feature data alone, without a render-able body.- (future idea?)
page
- Content intended to be used as pages. These feature data and a render-able body, along with properties for mapping URLs, like apermalink
.
I also worry that the type
API reads like a type cast, implying you could import a content collection as a data collection. Though I also see a parallel to import assertions which could be nice. Either way, since I don't see us having more than the 3 shapes outlined above, I think getDataCollection()
is a compromise to avoid breaking changes. We'll think over it though!
Well I'm a man of my word! Here's a preview branch + video overview of the new data collection APIs. Still waiting on the preview release CI action (something's holding it up...) but you should be able to clone the repo and try the examples/with-data
starter ๐
Looks like the link got cut off, here's the actual PR: withastro/astro#6850
Discussion on src/data/
vs. src/content/
We've considered two options for storying data collections: using the same src/content/
directory we have today, or introducing a new, separate src/data/
directory.
Why src/data/
- Follows patterns for storing arbitrary JSON or YAML in 11ty (see the
_data/
convention). - Clearly defines how data and content are distinct concepts that return different information (ex. content includes a
body
and arender()
utility, while data does not). This makes data-specific APIs likegetDataEntryById()
easier to conceptualize. - Avoids confusion on whether mixing content and data in the same collection is supported; Collections should be distinctly content or data. We can add appropriate error states for
src/content/
, but using the directory to define the collection type creates a pit of success.
Why src/content/
- Follows Nuxt content's pattern for storing data in a
content/
directory - Avoids a new reserved directory. This could mean a simpler learning curve, i.e. I already know collections live in
src/content/
, so I'll add my new "data" collection in this directory too. From user testing with the core team, this expectation arose a few times. - Allows switching from JSON to MD and back again more easily, without moving the directory. Example: you find a need for landing pages and bios for your
authors
data collection, so you movejson -> md
while retaining your schema. - Avoids the need for moving the collection config to your project root. With
src/data/
, requiring theconfig
to live in asrc/content/config.ts
is confusing. This is amplified when you do not have any content collections.
Conclusion
There are compelling pros on both sides. Today, we have a deploy preview using src/data/
to get user feedback before making a final decision. Though based on the API bash with our core team and feedback below, using src/content/
for everything could be the more intuitive API.
This one's really minor, but I also like that reusing src/content/
means Astro isn't claiming another special directory in src
That means one less major
breaking change and less chance of getting in the user's way if someone wants their own src/data
directory
Using src/content
also allows for colocating your data with your content, where you could even put frontmatter you might end up putting in a markdown file in a separate json data file
@jasikpark So as part of this, I don't expect data and content to be able to live in the same collection. We'd require users to specify the type of a collection in their config file (i.e. defineCollection({ type: 'data' ... })
). This is arguably a point against supporting everything in src/content/
since it may encourage users to try this pattern when it is not supported.
ohhhhhh i forgot that
my-post/
content.mdoc
post-image.jpeg
post-image2.webp
data.json
wouldn't be supported...
hmm i dunno how i feel about either directory then ๐
@jasikpark Well that would be supported still, as long as you add an underscore _
to the file name to mark as ignored in our type checker. We won't run mixed content and data through the same Zod schema. So colocation is fine, but mixed validation is not
Ok - I guess I've been thinking of a collection entry as a folder rather than a markdoc file ๐ good to understand that better
@jasikpark That's a model keystatic has adopted actually! Since nested directories are used for slug
/ id
generation, we haven't used this same model. It almost reminds me of the NextJS app/
directory vs. our current routing story.
cool, i'll play around w/ that then - thx for all the responses ๐
how does it make you think of the app folder for nextjs?
@jasikpark Well, it's the difference between file vs. directory-based routing. I think content collections and Astro's pages/
router have a lot of parallels, including the underscore _
for colocation. The other solution is to use wrapper directories for everything, where key files have a special name (like page.tsx
or content.mdoc
) with colocation allowed for any other files. I've heard thoughts on supporting both, which is interesting!
ooohhhh thx for clarifying, that's interesting yeah
I'm more in favour of src/content
, mostly because reserving another directory feels too much. This doesn't mean that we can't have an src/content/data
folder, although it might NOT make sense because they are two different concepts.
Thanks for the input y'all! Implemented src/content/
on the latest PR. Stage 3 RFC to come
Closing as this is completed and in stable.