Securing Extreme Availability of Large Scale Enterprise System

Telecommunications / Operational Support Systems


A client assignment to secure extreme availability (> 5 Nine ’s) of large scale enterprise system for both physical and cloud deployments. The approach followed was to aggressively break the system through random failure introduction, monitor for compromises in system availability and design in resilience functionality.

Purpose of Project


Ensure that system is fully resilient to internal and external failures and critical functionality is always available in a production environment.

A secondary goal was to develop and maintain an automated regression test to ensure system availability not degraded during system development.

Solution


The overall solution was modelled on 'Chaos Monkey' practices pioneered by Netflix and other companies deploying on Amazon Web Services.

Creation of random system failure and stress testing through invasive software based agents.

Measurement and monitoring of system behaviour and reaction to failure and stress testing. Measurement and monitoring of system availability during failure and stress testing. Measuring availability during system upgrade and preventing any degrade.

Designing in resilience and recoverability in response to any observed lack of availability during failure/ stress testing. Constant evolution of new failure scenarios based on issue slip through analysis.

Building tests into Test Automation and Continuous Integration frameworks to ensure no degrade in system availability as new functionality is added during development phase.

The availability tests were originally designed for a Physical Deployment and subsequently modified to work in an OpenStack Private Cloud Deployment.

Benefits to Organisation


Client has enjoyed reduced cost for maintenance and support of their customer deployed systems, increase in customer satisfaction and enhanced reputation due to meeting customer expectations on system availability and resilience.