alestic/ec2-expire-snapshots

Snapshot deletes fail without errors after ec2 session times out

waynerobinson opened this issue · 16 comments

The program claims to delete a specific snapshot id.
That snapshot id is not deleted when the program completes.
This appears to be happening after deleting many (hundreds) of snapshots.

This may or may not be related to a rate limit that sometimes shows up in other large snapshot delete operations.

Thanks for reporting this. Are you saying that the program is not reporting any errors trying to delete the snapshot? How do you know it is a rate limit error?

The program definitely isn't reporting any errors.

I am making the assumption (due to my Perl skills being pretty non-existent) that it is a rate limit error because I'm currently trying to clean up a snapshot archive with more than 20k snapshots, deleting many hundreds per volume at a time and I've been getting the odd rate limit error in the EC2 Console and my automated snapshotting script as I'm using the ec2-expire-snapshot script.

In situations where I have been deleting around 300+ snapshots, not only the first couple of hundred seem to get deleted, yet the script runs to completion. Thankfully, given the implementation of the script I can run it again to remove any remaining snapshots.

I would prefer that these rate limit errors retried after waiting for a period of time for the rate limit counters to recharge (which is basically how they work) as well as potentially having the ability to have the ec2-expire-snapshot script work in a more "off peak" manner… for example, deleting no more than 1 snapshot per second (or similar) so it doesn't interfere with the operation of other production scripts (snapshotting, instance creation/control, etc).

Does this describe what you are actually seeing?

  • The program claims to delete a specific snapshot id.
  • That snapshot id is not deleted when the program completes.

That is correct.

The code does appear to check for an error return from the EBS snapshot delete call, so there might be an issue with the underlying package.

I'd be comfortable adding an option to reduce the rate of EBS snapshot deletes by adding an N second delay between deletes, say:

ec2-expire-snapshots --delete-delay N [...]

I'm not sure when I'll get around to this, but if somebody submits a patch or merge request, I'd be happy to review.

Maybe this isn't a rate limit issue. I am running the same expiration scripts with a 2 second delay and it appears that the script still stalls in a similar way (saying it's deleting snapshots but not actually deleting them).

Is there some way I can turn on full logging for all HTTP messages?

Try uncommenting this line:
# ($Debug ? (debug => 1) : ()),

I'm wondering if it is a timing issue or if we need to re-create the Net::Amazon::EC2 object after a while.

Does increasing the --delete-delay significantly (60 seconds?) cause fewer snapshots to be deleted successfully or more?

OK, at some point after starting deletions the API starts returning:

$VAR1 = {
          'RequestID' => 'N/A',
          'Errors' => [
                      {
                        'Error' => {
                                   'Code' => 'HTTP POST FAILURE',
                                   'Message' => '503 Service Unavailable'
                                 }
                      }
                    ]
        };


ERROR CODE: HTTP POST FAILURE MESSAGE: 503 Service Unavailable FOR REQUEST: N/A

I've tested this with delete delays of 5 seconds, however will try a 30 second version (although it may take some time for it to start failing).

Hm, sure would be nice if Net::Amazon::EC2 detected and returned that error.

OK, it appears that the 503 may be the rate limit message (the above message was the result of no delay).

After waiting some time in the 30 second delayed version it started generating the below messages:

$VAR1 = {
          'RequestID' => 'b856ee33-69b6-4d78-b806-a4e4c8d6c91c',
          'Errors' => [
                      {
                        'Error' => {
                                   'Code' => 'RequestExpired',
                                   'Message' => 'Request has expired. Timestamp date is 2013-07-12T03:07:00.000Z'
                                 }
                      }
                    ]
        };


ERROR CODE: RequestExpired MESSAGE: Request has expired. Timestamp date is 2013-07-12T03:07:00.000Z FOR REQUEST: b856ee33-69b6-4d78-b806-a4e4c8d6c91c

So you might be right, there may be a session timeout or something that you may need to regenerate the `Net::Amazon::EC2" object after, maybe regenerate and retry when the timeout message is received?

Thanks for running these tests. Am I correct in understanding that you see these errors only in the Net::Amazon::EC2 debug output and that none of them show up as being reported by ec2-expire-snapshots?

Yes, these are only available in the library's debug output.

ec2-expire-snapshots just ignores both these errors.

Looks like this thread got lost for a while. After re-reading, I think the next step would be to test having the program re-open a new Net::Amazon::EC2 object every N minutes. If this improves the situation, a pull request can be submitted.