lsst-epo/rubin-obs-api

Troubleshoot what's taking up ephemeral storage causing containers to enter a bad state

Closed this issue · 1 comments

The regression which caused the API container to drop from 4Gi to 2Gi of ephemeral storage illuminated that much more space is being taken up than expected. I need to perform root cause analysis to ensure that this problem does not scale as traffic grows.

My suspicion is that the transformations that are happening on-the-fly via GQL queries are eating up storage space - these transformations are saved to disk before being sent along to GCS. I have set up an alert to go off once a container reaches 1Gi of ephemeral storage. The only way to determine for certain what's going on is to use the following commands once the alert goes off:

Disk Utility:

du -h --max-depth=1 
  • or -

Disk Free Space:

df

The best place to run these commands is in the ./storage folder in the container and then cd up each subfolder until I find the culprit.