Practical Single Node Failure Recovery Using Fractional Repetition Codes in Data Centers
Abstract
Node failures in distributed storage systems are becoming a critical issue, and many
erasure codes are designed to handle such failures. The purpose of this paper is to
evaluate Fractional Repetition (FR) codes, a class of regenerating codes for distributed
storage systems, as a practical solution. FR codes consist of a concatenation of an outer
Maximum Distance Separable (MDS) code and an inner fractional repetition code that
splits the data into several blocks and stores multiple replicas of each on different nodes
in the system. We model the problem as an integer linear programming problem that uses
modified versions of the fractional repetition code by allowing different block sizes, and
minimizes the recovery cost of all single node failure scenarios. The contribution of this
work is three fold: We generate an optimized block distribution schema that minimizes
the total system repair cost in a data center and we present a full recovery plan for the
system. In addition, we account for new-comer blocks and allocate them to nodes with
minimal computations and without changing the original optimal schema. This makes
our work practical to apply. Hence, a practical solution for node failures is presented by
using a self-designed genetic algorithm that searches within the feasible solution space.
We show that our results are close to optimal.
Author(s)
Itani M., Sharafeddine S., El kabbani I.
Journal/Conference Information
IEEE 30th International Conference on Advanced Information Networking and Applications (AINA - 2016),