cal-itp/data-infra

Move gtfs-data and gtfs-data-test to cold storage

Opened this issue · 3 comments

We store a ton of old data in gtfs-data (duplicated in gtfs-data-test) that isn't accessed anymore. We've confirmed we do not plan to backfill the RT data from H1 of 2022, so we can go ahead and move these to a colder storage class and save some storage money.

  1. Confirm there is no traffic to these buckets
  2. Consider limiting read access to this bucket
  3. Change the default storage class
  4. Write a script/notebook to convert all the existing objects to a colder storage class

This may replace #859

After some investigation, we should probably consider configuring Autoclass for these buckets (I don't believe that was an option at the time this ticket was created). Thanks for the flag @mjumbewu, cc @SorenSpicknall

From what I can tell, pricing implications are:

Pricing
Cloud Storage pricing remains the same for Autoclass-enabled buckets, with the following exceptions:

  • Retrieval fees are never charged
  • Early deletion fees are never charged.
  • All operations are charged at the Standard storage rate.
  • There is no operation charge when Autoclass transitions an object to a colder storage class.
  • There is no Class A operation charge when Autoclass transitions an object from Nearline storage to Standard storage.
  • When Autoclass transitions an object from Coldline storage or Archive storage to Standard storage or Nearline storage, each such transition incurs a Class A operation charge.
  • A management fee and enablement charge apply when using Autoclass.

More information related to Autoclass Management Fee and Autoclass Enablement Charge (found on this page):

Autoclass charges

The following additional charges are associated with buckets that use the Autoclass feature:

Autoclass management fee: Buckets that have Autoclass enabled incur a monthly fee of $0.0025 for every 1000 objects stored within them.

  • Objects smaller than 128 kibibytes are not counted when determining the fee.
  • The fee is prorated to the millisecond for each object that isn't stored for the full month.
  • The fee is also prorated to the millisecond when disabling Autoclass.

Autoclass enablement charge: Buckets that enable Autoclass have a one-time charge for configuring existing objects to use Autoclass. This charge applies even if you immediately disable Autoclass and includes the following, as applicable:

  • Early delete charges for objects that haven't met their minimum storage duration
  • Retrieval fees for objects not currently in Standard storage
  • A Class A operation charge for each object in the bucket, in order to transition them to Autoclass pricing and Standard storage
    • Objects that are smaller than 128 kibibytes and already stored in Standard storage at the time Autoclass is enabled are excluded from this operation charge

Okay I generated a single-day inventory report as parquet files for each bucket, gtfs-data and gtfs-data-test, and read them in as tables to Bigquery (staging.gtfs_data_reports and staging.gtfs_data_test_reports).

The report tables include object size (apparently in bytes?). Autoclass charges $.0025/1000 objects that are over 128 kibibytes. 128 kibibytes --> 131072 bytes, so I queried each new table:

SELECT
COUNT(*)
FROM `staging.gtfs_data_reports`/`staging.gtfs_data_test_reports`
WHERE size > 131072

gtfs_data_reports = 12532519 rows / 1000 = 12,533 * .0025 = $31.33
gtfs_data_test_reports = 12149779 rows / 1000 = 12,150 * .0025 = $30.37

So roughly $61.70 to convert both of these buckets to AutoClass? Want to check my work @SorenSpicknall?

That is mostly correct, but you're missing a onetime Class A operation charge per object at the time AutoClass is enabled for that object. Class A operations are currently priced at $0.0050 per 1000 operations. So you're talking about:

gtfs_data_reports
12532519 rows / 1000 = 12,533 * .0050 = $62.66 startup charge for Autoclass
12532519 rows / 1000 = 12,533 * .0025 = $31.33 monthly charge for Autoclass

gtfs_data_test_reports
12149779 rows / 1000 = 12,150 * .0050 = $60.74 startup charge for Autoclass
12149779 rows / 1000 = 12,150 * .0025 = $30.37 monthly charge for Autoclass

What you'll need to do next here is calculate the cost difference between keeping these objects at Standard and keeping them at Archive. That pricing is based on total byte count, not object count, so you'll need to sum size over the same subset of objects in each bucket that meets the minimum Autoclass size threshold. Roughly:

[sum of size for each eligible object] / [2 ^ 30 bytes in a gibibyte] * [$0.0230 monthly per GB in standard] = standard cost
[sum of size for each eligible object] / [2 ^ 30 bytes in a gibibyte] * [$0.0025 monthly per GB in archive] = archive cost

These cost differences wouldn't be fully realized until after a year, since Autoclass slowly moves objects through storage class types before eventually reaching Archive, but we can use an annualized cost as a baseline for determining whether eventual savings in storage costs will be worth the startup costs for using Autoclass on these buckets.

Let me know if you need any other context for these calculations!