Using an architectural review for improving site reliability
I stumbled across another AWS Blogger, Eric Hammond who blogs at https://alestic.com
One of the recent things which Eric has done is his Unreliable Town Clock (UTC) which you can use to schedule triggering of AWS Lambda functions. Its a cool idea.
Eric certainly knows what he is doing, he not only launched a service he sat down and ensured "this service is as reliable as I can reasonably make it". No wonder he is a AWS Community Hero!
Of course reliability is only one of the elements of an architectural review of an AWS environment. You should cover off such things as Security, Availability, Scalability and Cost Efficiency. Eric has covered some of this. Check out what he has done to ensure UTC is always up and running, there are some great tips in there.
What if you wanted to do a architectural review of your AWS environment. How would you go about that? What questions would you ask? What things require focus? Maybe post in the comments. Saying I will call my friendly AWS Solution Architect is cheating, although its a great idea.
Two items that will really help you get started with a review are these whitepapers.
What would you do beyond this? Here is some very small things I would investigate.
- Auditing. Is CloudTrail, Config and VPC Flows all turned on? Its hard to do debugging or forensics on something in the past when you were not capturing the data. Is all the activity from the instance logged to CloudWatch Logs?
- What dependancies are there that might stop a failed employment? That autoscaling group may relaunch an instance if it fails. What AMI is it using? Is it your own AMI sitting in the account or are you launching from a public one? What if the public ones goes away because a new one is released? How is the code deployed into that AMI? Is it baked in, coming from S3, does it need to download software from github, what if it can't?
- Monitoring. There are 4 metrics in CloudWatch for SNS. Are there any alarms that could be created to provide alert of failure? What if the number of published messages dropped below a certain rate? An alarm like that could replace what Eric is using Cronitor.io for. You can even create those alarms with CloudFormation!
- Turning on MFA is always a great idea.
This is the simplest of examples. For your typical system there are hundreds of review items to assess. But you get the idea.