Abstract:
Cloud computing provides access to large pool of data, applications and computational resources. Many researchers have been proposed several open-source private cloud management frameworks (e.g., Eucalyptus, Nimbus, and OpenNebula). However, there is no fully automatic fault-tolerance support in private cloud development. In this paper, we propose a new fault-tolerant checkpoint/restart system for hierarchical private cloud. Checkpoint/restart is the simplest way to implement fault-tolerance system in large High Performance Computing (HPC) system. Checkpoint save an application state and restart resume an application execution using the last saved state, on the same machine, or on another machine. We also use Reed-Solomon erasure code to achieve high availability and durability of the checkpoint/restart system.