- 1107511.pdf (391k)
IEEE International Conference on Cloud Computing Technology and Science;
IEEE Computer Society
This paper demonstrates a bottom-up approach to developing autonomic fault tolerance and disaster recovery on cloud-based deployments. We avoid lock-in to specific recovery features provided by the cloud itself, and instead show that tools used in system administration today can provide the foundation for recovery processes with few additions. The resulting system, Hydra, is capable of detecting failures in instances and can redeploy any instance at a new location without human intervention. The layered design of configuration management tools enables separation of recovery processes and the actual service, making the Hydra applicable to a wide range of scenarios. The implementation was tested and an analysis of the recovery time is provided, demonstrating that the Hydra is capable of completely rebuilding a new site in 15 minutes.