operational excellence cloud

Operational Excellence in the Cloud

Amazon Web Services insists on five principles: Security, Performance, Reliability, Cost Optimization, and Operational Excellence. Most are self explanatory; Security, ensure that your system cannot be hacked; Reliability, ensure that your system always works; Performance; ensure that your system responds quickly; Cost Optimization, ensure that your system does not break the bank. So what is Operational Excellence? To explain Operational Excellence, we need to understand how cloud environments provide value, and continue to provide value for their lifespan. Cloud environments help companies set up web applications, store data and distribute information.

To explain how to create Operationally Excellent cloud environments, we can look at a cloud environment like a manufactured product, as an assembled set of components. Each component is replaceable, manufactured in mass, and tested to ensure it meets a specific function. This component can be anything from door handles on cars, to springs in hinges. If a door handle breaks, another handle can be put in the door easily and effectively. If a spring is made, the manufacturer guarantees it will last a certain amount of time if used within its service capacity, (i.e. it is not over stretched or tampered with) Operational excellence is the practice of having storage, computation or other components easy to rebuild or fix.

Register for our upcoming live webinar.

 

This concept of assembling components into a large system is utilized in cloud computing as Infrastructure-as-Code (IaC). IaC has been used by Amazon, Google, Terraform and Microsoft to allow cloud environments to be written down as a configuration file with each component listed and detailed. The file is run against the cloud provider and the environment available for use. The IaC file example below is from Amazon’s IaC product called Cloud Formation. It creates an alarm that will send a notification if costs exceed a certain amount.

SpendingAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: 
'Fn::Join':
- ''
- - Alarm if AWS spending is over $
- Ref: AlarmThreshold
Namespace: AWS/Billing
MetricName: EstimatedCharges
Dimensions:
- Name: Currency
Value: USD
Statistic: Maximum
Period: '21600'
EvaluationPeriods: '1'
Threshold:
Ref: "6000"
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- Ref: "BillingAlarmNotification"
InsufficientDataActions:
- Ref: "BillingAlarmNotification"

Although this file may look complicated, it lists many necessary attributes for the spending alarm, such as where notifications should go to under “AlarmActions”, and at what spending level the alarm should be triggered, under “Threshold”. Since this is IaC file acts as the alarm component for all cloud environments in a company, each attribute can be easily modified, and new alarms can be tested and deployed quickly, allowing for greater flexibility and control over cloud components. The IaC file has the added benefit of acting as documentation since engineers can simply look at the file if they want to understand how a cloud environment is built.

Keeping to the same standards for components in IaC files for logging or other components, such as authentication, storage or computation, in cloud environments is Operational Excellence. Everything is templatized and modular to allow new technologies to be built, tested, and deployed. New environments could be stood up quickly and effortlessly.

Although Operational Excellence provides a great deal of value, it begs the question, why is this not common practice. The answer is time. At Caserta, we often recommend creating IaC files for system components, however few companies see the value in taking apart their existing cloud environment and rebuilt it with IaC files as it does not provide immediate value to the end user. The payoff comes with time, as engineers are free to move quickly without having to worry about destroying the system. Systems are documented and all engineers can easily understand the inner workings of environments. This is as valuable as the time required for engineers to define system requirements or understand components, and thus we highly recommend considering Operational Excellence in your cloud environments.