Using an architectural review for improving site reliability

Posted on Tuesday, June 16, 2015 | 2 Comments

I stumbled across another AWS Blogger, Eric Hammond who blogs at https://alestic.com

One of the recent things which Eric has done is his Unreliable Town Clock (UTC) which you can use to schedule triggering of AWS Lambda functions. Its a cool idea.

Eric certainly knows what he is doing, he not only launched a service he sat down and ensured "this service is as reliable as I can reasonably make it". No wonder he is a AWS Community Hero!

Of course reliability is only one of the elements of an architectural review of an AWS environment. You should cover off such things as Security, Availability, Scalability and Cost Efficiency. Eric has covered some of this. Check out what he has done to ensure UTC is always up and running, there are some great tips in there.

What if you wanted to do a architectural review of your AWS environment. How would you go about that? What questions would you ask? What things require focus? Maybe post in the comments. Saying I will call my friendly AWS Solution Architect is cheating, although its a great idea.

Two items that will really help you get started with a review are these whitepapers.

What would you do beyond this? Here is some very small things I would investigate.

Auditing. Is CloudTrail, Config and VPC Flows all turned on? Its hard to do debugging or forensics on something in the past when you were not capturing the data. Is all the activity from the instance logged to CloudWatch Logs?
What dependancies are there that might stop a failed employment? That autoscaling group may relaunch an instance if it fails. What AMI is it using? Is it your own AMI sitting in the account or are you launching from a public one? What if the public ones goes away because a new one is released? How is the code deployed into that AMI? Is it baked in, coming from S3, does it need to download software from github, what if it can't?
Monitoring. There are 4 metrics in CloudWatch for SNS. Are there any alarms that could be created to provide alert of failure? What if the number of published messages dropped below a certain rate? An alarm like that could replace what Eric is using Cronitor.io for. You can even create those alarms with CloudFormation!
Turning on MFA is always a great idea.

This is the simplest of examples. For your typical system there are hundreds of review items to assess. But you get the idea.

Doing an architectural review is something you should do periodically in your AWS environment. As AWS keeps releasing new features there is frequently new things you can do to improve your setup.

If only everyone was like Eric! Also, anyone use builds everything in CloudFormation is a winner in my book!

Rodos

Comments:2

Eric7:17 am
This comment has been removed by the author.
ReplyDelete
Replies
Eric7:19 am
Rodos:

Though I like having an external party monitoring the Unreliable Town Clock, one can never have enough eyes watching critical services and I'd love to add a CloudWatch Alarm. Unfortunately, I was unable to figure out how to configure CloudWatch Alarms to alert me when a quarter hour SNS message was missed. If you have a way to set this up, I'd love to hear it.

I understand (and have experienced in other projects) the risk associated with having an EC2 instance failure right when GitHub has an outage, but that's a risk I'm willing to take, as one of the primary goals of the Unreliable Town Clock is to be dead simple to set up. Similarly, I am going to trust Canonical's history of keeping public Ubuntu AMIs available pretty much forever (contrary to the Amazon Linux AMI policy).

Some of your other ideas have been implemented, some are already on my todo list, and I've added a couple others to the todo list.

Thanks!
-- Eric Hammond
ReplyDelete
Replies

Add comment

Musings of Rodos

Using an architectural review for improving site reliability

Comments:2

Rodney Haywood

Archives

TripIt

Categories