flogy/gatsby-mdx-tts

Frequent cache clearing costs money

andrinmeier opened this issue ยท 8 comments

I've had this happen a lot. After changing a page gatsby throws an error because something somewhere is not cached right / cached anymore. This forces a rebuild of the entire site which means all speech output mp3 files are generated again. This costs a lot of money if your site is big. Perhaps there's a way of solving this? Add a way of storing the output externally? Or maybe commit the mp3 files to git and label them with content hashes?

flogy commented

Would really be a good idea to have a solution for this. Could also be helpful for CI/CD.

Do you know if there is a best practice from Gatsby that other plugins use? So far nothing like that is known to me. I think it would be better to do a quick research on that before implementing a custom solution.

I don't know of any best practices either. The official Gatsby documentations don't mention anything in this way. What I could gather from reading various blog posts and tutorials is that most people simply store data in an external service. In our case it would probably make sense to store the speech marks and mp3 files in an S3 bucket, what do you think? That only makes sense if S3 costs less than Polly but that seems to be the case. I used the AWS price calculator with conservative estimates for the Frankfurt (EU) datacenter and it comes out to around 0.50 USD / month.

On rgz-blind.ch we're still on the AWS free tier and already reached around 3 million characters because of frequent cache invalidations and changes to the website. That would cost around 12$ on a non-free tier.

Another downside next to the potential S3 costs is the latency introduced by having to access the internet a lot during build times. This makes build times slower. This could be a problem if you use Netlify because they charge you based on build minutes. Just something to keep in mind.

If you want I can try and add support for S3 buckets?

flogy commented

S3 sounds good! How would you trigger a cache invalidation though?

As this could be useful for other projects as well, what do you think about building this in a separate repository? It could then be used by this library in a similar manner as the internal Gatsby cache, ideally with a similar API ๐Ÿ™‚ Also, I think it would make sense to build it separately to not bloat this repository much more - I think it is already a bit too big ๐Ÿ˜•

flogy commented

Another thought on this: if we want it to be used by other plugins as well it might make sense to not store the cache on S3 but just in a separate directory in the project that is not controlled by Gatsby. This would not force users to set up an AWS account. Would not be a problem for using it in this project, as AWS is already a prerequisite though.

S3 sounds good! How would you trigger a cache invalidation though?

As this could be useful for other projects as well, what do you think about building this in a separate repository? It could then be used by this library in a similar manner as the internal Gatsby cache, ideally with a similar API ๐Ÿ™‚ Also, I think it would make sense to build it separately to not bloat this repository much more - I think it is already a bit too big ๐Ÿ˜•

So you're thinking of a generic persistent cache based on S3? There are a lot of s3 plugins for gatsby already from what I could gather. Perhaps we can take some inspiration from the existing ones.

Perhaps we could first check the local cache, if that doesn't have the data we need, we look it up in the S3 bucket, if it doesn't exist we add it to the local cache and upload the data to S3.

In order to prevent the S3 bucket from growing too big we could regularly go through the bucket and remove data that isn't referenced anymore as well as add a plugin option for enabling this behaviour.

Another thought on this: if we want it to be used by other plugins as well it might make sense to not store the cache on S3 but just in a separate directory in the project that is not controlled by Gatsby. This would not force users to set up an AWS account. Would not be a problem for using it in this project, as AWS is already a prerequisite though.

Yes, but that would mean we'd have to commit it to git right? That might make the git repository too big. Although on my site the entire tts folder is only about 30MB which is still acceptable, IMHO.

flogy commented

Regarding the S3 cache: that really sounds awesome! A CLI command to manually clear the cache may also come in handy ๐Ÿ™‚ Found a similar plugin that uses SFTP to store the cache on an external server: https://github.com/axe312ger/gatsby-plugin-sftp-cache

Regarding the local cache: I think of it more like an intermediate solution between the really transient Gatsby cache and the S3 cache, which requires users to have an AWS account. It could be committed to git but as well just be local and added to .gitignore. If it is just local, it would at least prevent some of the additional costs when doing local builds. But for CI/CD it might require the cache to be in version control (except maybe depending on the CI/CD you could define the folder as cache).
Here is a quite similar plugin: https://github.com/axe312ger/gatsby-plugin-netlify-cache

So my conclusion would be that there already are plugins that could be used to fix the issue of expensive TTS rebuilds. We could use those or build our own plugins in a separate repository to address the problem. Do you agree with this solution?

Yes, that's fine.