The problem of reliability of a large distributed system is analyzed via a new mathematical model. A typical framework is a system where a set of files are duplicated on several data servers. When one of these servers breaks down, all copies of files stored on it are lost. In this way, repeated failures may lead to losses of files. The efficiency of such a network is directly related to the performances of the mechanism used to duplicate files on servers. In this paper, we study the evolution of the network using a natural duplication policy giving priority to the files with the least number of copies.
We investigate the asymptotic behavior of the network when the number $N$ of servers is large. The analysis is complicated by the large dimension of the state space of the empirical distribution of the state of the network. A stochastic model of the evolution of the network which has values in state space whose dimension does not depend on $N$ is introduced. Despite this description does not have the Markov property, it turns out that it is converging in distribution, when the number of nodes goes to infinity, to a nonlinear Markov process. The rate of decay of the network, which is the key characteristic of interest of these systems, can be expressed in terms of this asymptotic process. The corresponding mean-field convergence results are established. A lower bound on the exponential decay, with respect to time, of the fraction of the number of initial files with at least one copy is obtained.
"A large scale analysis of unreliable stochastic networks." Ann. Appl. Probab. 28 (2) 851 - 887, April 2018. https://doi.org/10.1214/17-AAP1318