Reference: https://d1.awsstatic.com/whitepapers/aws-infrastructure-event-readiness.pdf
Enter event details here.
Success criteria:
- Success criteria 1
- Success criteria 2
- Success criteria 3
Start and End Time (include time zone)
- Start: MMM DD, YYYY HH:MM AM/PM GMT +8 (Day)
- End: MMM DD, YYYY HH:MM AM/PM GMT +8 (Day)
Please list all contacts that will be involved in the event:
- Group
- Name
- Phone
- Role
Enter conference details here.
Please provide URL and credentials for any websites, portals or monitoring systems that we will need to access during the event:
- CloudFront distribution (ID or CNAME)
- Route 53 hosted zone/s
- Simple Email Service (SES) Domain
- Other
- AWS Account
- AWS Regions
- Instance ID's
- ELB Name
- Autoscaling Group
- S3 Bucket Names
- VPC ID
- DynamoDB Hostnames
- RDS Hostnames
- Elasticache
- Cloudformation Stacks
- AWS Support Case Id: 1234567890
- Use Case
- Region
- ELB
- ELB AZ's
- Backend Instances > Is it currently scaled?
- Backend Instances > If no, when would it be scaled?
- Backend Instances > Are they running persistent connections? (Keep alive?)
- SSL > % of Traffic that is SSL: 100%
- Traffic Increase > Start Date
- Traffic Increase > End Date
- Traffic Increase > Duration of Increase
- Traffic Increase > If several days, is there an exact time for the load increase?
- Traffic Increase > If several days, is there an exact time for the load decrease?
- Traffic Increase > Get Requests/Sec (XXX/sec on avg, peak XXX rps)
- Rate of Increase > Traffic is expected to increase: 22 times (22x) normal traffic
- Data > Average size (KB) of each GET request: X KB
- Data > Average size (KB) of each GET response: X KB
- Type of Event:
- Key dates:
- Support Requirements of the event
- List of Business-Critical Workloads and AWS Service dependencies:
- Potential Capacity requirements
- Subjects requiring expert level review:
- Last Well Architected Review
- Partner: AWS Account ID of highest utilization involved account, then select Account Name
- Subject: IEM Alert - - Please use this naming convention exactly
- Case Body: Requesting proactive support for IEM.
Step 13: Step 13: Determine Event Readiness - Use the Event Readiness Scorecard to guide this conversation.
- AWS Service Readiness (Availability and Capacity)
- Operational Readiness (Stability)
- Security Readiness (DDoS Prep, Trusted Advisor checks)
- Drive open action items to completion
The questions below are intended to be answered by technical staff responsible for the successful outcome of the IEM event.
- What internal tools are being used for monitoring during the event that AWS needs access to?
- What metrics are crucial to be informed about during the event?
- ELB: Latency, ELB 5XX,4XX, Backend 5XX,4XX, Backend connection errors, Surge Queue, Healthy host count
- EC2: CPU, Network In, Network Packets In, IOPS, Disk Latencies
- ElastiCache Memcached: GetHits, GetMisses, BytesUsedForCacheItems, Evictions, FreeableMemory, CPUUtilization, NetworkBytesIn, NetworkBytesOut
- RDS: Read and Write query throughput and performance, Replication and reliability, Resource utilization, Connections
- When was the last rotation of credentials performed on resources specific to this event?
- How will the environment be accessed during the event?
- How will the environment be accessed during component/system failure?
- What risks exist to the stability of the application during the event?
- Is there a single point of failure? (Eg. Single AZ)
- Where is content stored and is it resilient?
- What is the impact of any lost content/data/systems?
- What service restoration plans are in place?
- Has RDS/Auto Scaling/ELB been deployed across multiple AZs?
- In the case of outage or performance degradation is failover an automatic or manual process?
- EBS Snapshots performed?
- RDS Instances and other storage services recently backed up?
- Is there an infrastructure stress point in the design? (Eg. network bandwidth on instance)
- Is there an application stress point in the design? (Eg. single row on database)
- What load testing will be performed on the platform?
- Is there a CDN being utilized for this event?
- Has logging been enabled on key resources to identify potential areas of performance for a repeat event? (output to S3 or other?)
- Have alarms been set for various performance breaches? (Eg. Cloudwatch or other customer based tool, where is the output going to?)
- Which AWS services will be used that may need increased limits for this event?
- Are there any autoscaling groups that have a scaling policy that will start more instances than the limit for the region?
- Will EBS storage needs throughout the event be greater than the maximum allowable volume capacity per region limit?
- Will emails be sent from EC2 instances and at what rate?
- Has the application been sized for the correct amount of Provisioned IOPS and is there a chance this may need to be raised throughout the event?
- Will spot instances be used and if so could this service limit be triggered?
- Will additional S3 buckets be created throughout the event and if so will the 100 bucket hard limit per account be breached?
- If SES is being used for the sending of email, what is the maximum number of emails expected to be sent over a 24 hour period?
Enter Load Testing Results here.
- Description of primary plan
- What events will trigger a move to the secondary plan?
- Who needs to be notified about moving between plans?
- How will these resources be notified?
- How can we verify success?
- Description of secondary plan
- What events will trigger a complete rollback?
- Who needs to be notified about this?
- How will these resources be notified?
- How can we verify success?
- GREEN: Single Points of Failure eliminated at all tiers
- YELLOW: At risk for failure in one or more tiers
- RED: No failover components or plans
- GREEN: No languishing cases or open tickets to service teams that would impact system stability
- YELLOW: Risk - Customer has opened several recurring cases recently that indicate a trend that could impact system stability during the event
- RED: Issue - Customer has one or more open cases that is not resolved; open tickets with Service Team is not progressing; this will impact system stability during the event if not resolved.
- GREEN: Service limits increased; Service team validated capacity is available in desired Region/AZ or Capacity is provisioned
- YELLOW: Risk - Capacity signals unknown or Capacity requested but not verified by Service Team
- RED: Issue - Capacity will not available; Service limits cannot be increased
- GREEN: Pre-warms requested or completed; Pre-warm check performed by ELB Service Team
- YELLOW: Risk - Pre-warms not requested; ELB configuration not verified; risk of not scaling to expected levels or delays in scaling eminent
- RED: Issue - ELBs undersized or not correctly configured, back-ends not configured; system will not scale to expected levels during event
- GREEN: Completed load testing - all systems capable of expected load during event
- YELLOW: Risk - Load testing not completed or levels not satisfactory; risk of not scaling to expected levels during event
- RED: Issue - Load testing failed; system will not scale to expected levels during event
- GREEN: TA Security Checks all Green or determined will not impact event (Attention on MFA on Root, CloudTrail Logging, Exposed Access Keys, Security Groups - Ports Unrestricted)
- YELLOW: Risk - One or more Security Checks appear as Yellow: Investigation recommended
- RED: Issue - One or more Security Checks appear as Red: Action recommended (Capture Action for Reporting)
- GREEN: Customer is following minimum DDoS Protection (BP1, BP3, BP6, & BP7 - ref: https://d0.awsstatic.com/whitepapers/Security/DDoS_White_Paper.pdf
- YELLOW: Risk - Customer is not substantially following the best practices for DDoS-resiliency and is operating with an elevated degree of risk
- RED: Issue - Application is poorly designed and deployed and is NOT prepared to withstand common DoS/DDoS attacks
- Description
- Status : Red | Yellow | Green
- Reason :
- Owner:
- Remarks:
Enter recommendations here.
Enter Post IEM Analysis here.
Enter References here.
Enter AWS Service Usage for the 12 months.