LouisvilleMetro/WazeCCPProcessor

Work to reduce database costs

schnuerle opened this issue · 8 comments

Once the RDS is complete in Phase 1, we'd like to find ways to reduce the overall cloud costs. If you have ideas, you can make a pull request to add support for it with examples to this repo so everyone can benefit.

One idea is to pull data into a database that is real-time in nature (gets updated every 2 minutes, but maybe only has 1 day of data) that runs all the time, and then have a second serverless database that is historic in nature (has all data and gets an update from the real-time table nightly) which will only be active when queried, thus saving costs.

Use this issue as a place to discuss and collaborate.

A "serverless database" could be nice. Depending on the volume of data and number of tables, there are a few ways you could do this with S3 – either storing the data in columnar a format like Parquet or simply storing it as gzipped CSVs. Depending on the route you take, you can use S3 Select or Athena to interrogate the data.

Going even further, if this is data that you know you can reliably recreate if needed, you could save even more costs by storing it in S3 One Zone-Infrequent Access.

Thanks @jedsundwall for hopping in with suggestions. I'm not that familiar with Parquet and the kinds of data it can handle. The JSON files that come from Waze every 2 minutes have a hierarchy and redundant data and to use the data for analysis you have query across multiple files. There is also a need to backload old files, and these have to be checked for duplicate data before loading. I have used Athena to test and create a schema around them but it seemed to not be efficient enough.

Maybe I can post a query we are doing now with Power BI to an internal DB (we'll be migrating it to the cloud soon) so you can see the types of things we need to run. And it seems like with an RDS you can more easily build integrations like APIs, add enriched fields and FK to other data tables, and query using standard DB desktop tools, but maybe that's me speaking from my RDS world view. @jrstill might know more.

The main cost right now is the DB (the S3 storage cost is minimal) since it always doing something every 1-2 minutes. A 1 day realtime DB and multi-year historic serverless DB combo makes the most sense to me to improve this setup and reduce cost, if sticking with an RDS setup.

We are pushing this off to later in the year, because Aurora Serverless Postgres is not available until later in 2018. https://aws.amazon.com/blogs/aws/in-the-works-amazon-aurora-serverless/

Also right now we are running a large instance, which is as small as Aurora Postgres goes now. There is talk of making a medium and/or small option, but until that happens, running a large real-time RDS won't reduce the costs at all. It may be possible to switch to a MySQL instance for the RT, which does support medium and small, but we'd have to change how some of our data is stored I think (like JSON chunks).

Do you need AWS Aurora? I manually switched to RDS PostgreSQL (db.t2.medium) and it works and its much cheaper (reduced from $10 to $2 per day).

@sinacek nice, thanks for trying and the suggestion. How does everything work? Can it process the data coming in alright? Get any DB errors?

The nice thing about non Aurora is you can shrink the DB down to save costs, then up when you are doing bigger analysis/extracts. @jrstill what do you think about this?

It works fine without any error. For now I have just updated lambda functions manually, because I haven't experience with terraform.

@sinacek can you specify what changes you made to your lambda functions to get this to work? Can you provide instructions for those who want to try this?

@schnuerle Sorry I'm not able found what I have changed year ago :/