Sensitivity analysis for stochastic chemical reaction networks with multiple time-scales

Stochastic models for chemical reaction networks have become very popular in recent years. For such models, the estimation of parameter sensitivities is an important and challenging problem. Sensitivity values help in analyzing the network, understanding its robustness properties and also in identifying the key reactions for a given outcome. Most of the methods that exist in the literature for the estimation of parameter sensitivities, rely on Monte Carlo simulations using Gillespie's stochastic simulation algorithm or its variants. It is well-known that such simulation methods can be prohibitively expensive when the network contains reactions firing at different time-scales, which is a feature of many important biochemical networks. For such networks, it is often possible to exploit the time-scale separation and approximately capture the original dynamics by simulating a"reduced"model, which is obtained by eliminating the fast reactions in a certain way. The aim of this paper is to tie these model reduction techniques with sensitivity analysis. We prove that under some conditions, the sensitivity values of the reduced model can be used to approximately recover the sensitivity values for the original model. Through an example we illustrate how our result can help in sharply reducing the computational costs for the estimation of parameter sensitivities for reaction networks with multiple time-scales. To prove our result, we use coupling arguments based on the random time change representation of Kurtz. We also exploit certain connections between the distributions of the occupation times of Markov chains and multi-dimensional wave equations.


Introduction
Chemical reaction networks have traditionally been studied using deterministic models that express the dynamics as a set of ordinary differential equations.Such models ignore the randomness in the dynamics which is caused by the discrete nature of molecular interactions.It is now widely accepted that this randomness can have a significant impact on the macroscopic properties of the system [15,26,24], when the molecules are present in low copy numbers.To account for this randomness and study its effects, a stochastic formulation of the dynamics is necessary, and the most common choice is to model the dynamics as a continuous time Markov process.Such stochastic models have been extensively used in many recent articles [8,3,23,25,27,19] to understand the biological implications of random dynamics.For a detailed survey of Markov models for chemical reaction networks we refer the readers to [2].
Typically, a chemical reaction network depends on various kinetic parameters whose values are uncertain or suffer from measurement error.To determine the effects of inaccuracies in the parameter values, one needs to estimate the sensitivities of a given output with respect to the parameter values.If an output is highly sensitive to a specific parameter value, then greater time and effort may be invested in determining that parameter precisely.Such sensitivity values can also be useful in fine-tuning a certain output (see [11]) or understanding the robustness properties of a system (see [35]).
Estimation of parameter sensitivities is fairly straightforward for deterministic models, but it poses a major challenge for stochastic models.Many methods have been proposed in the literature for tackling this problem [16,31,34,1,17].However all these methods reply on extensive simulations of the stochastic model, which is usually carried out using Gillespie's Stochastic Simulation Algorithm [14] or its variants [12,13].These simulation methods account for each and every reaction event, which makes them prohibitively expensive, when the network consists of reactions firing at different time-scales.In such a scenario, the "fast" reactions take up most of the computational time causing the simulation method to become very inefficient.Since time-scale separation is a feature of many important biochemical networks [29], a new class of methods have been designed to exploit this feature and efficiently simulate the stochastic model [5,37,6].These methods simulate a "reduced" model which is obtained by eliminating the fast components of the dynamics through a quasi-steady state approximation [18,30].Such reduced models capture the original dynamics in an approximate sense and the error in approximation disappears as the time-scale separation gets larger and larger.In [22], Kang and Kurtz develop a systematic theoretical framework for constructing these reduced models.As discussed in [5] and elsewhere, simulations of reduced models are generally much faster than the original model.Since most sensitivity estimation algorithms are simulation-based, it is of interest to determine if the parameter sensitivities for the original model can be approximated by the parameter sensitivities for the reduced model.Our aim in this paper is to present a theoretical result which shows that can indeed be done under certain conditions.Therefore one can obtain enormous savings in the computational costs required for the estimation of parameter sensitivities for stochastic models of multiscale reaction networks.From now on, the term "multiscale network" refers to a chemical reaction network which consist of reactions firing at different time-scales.
It is observed in [22] that variations in the reaction time-scales could be both due to variation in species numbers and due to variation in rate constants.However in this paper we will only consider the latter source of variation.We now describe our stochastic model of a multiscale chemical reaction network.Suppose we have a well-stirred system consisting of d chemical species.Its state at any time can be described by a vector in N d 0 whose i-th component is the non-negative integer corresponding to the number of molecules of the i-th species.These chemical species interact through K predefined reaction channels and every time the k-th reaction fires, the state of the system is displaced by the d-dimensional stoichiometric vector ζ k ∈ Z d .If the state of the system is x, the rate at which the k-th reaction fires is given by N β k 0 λ k (x), where N 0 is assumed to be a "large" normalization parameter and λ k : N d 0 → [0, ∞) is the propensity function for the k-th reaction.The powers of N 0 in front of the propensity functions, determine the various time-scales at which different reactions act.In a stochastic setting, such a chemical reaction network can be modeled as a continuous time Markov process {X N0 (t) : t ≥ 0} over N d 0 .Given such a reaction network we have the flexibility of selecting our reference time-scale as γ.This means that we observe the reaction dynamics at times that are scaled by the factor N γ 0 .In other words, we observe the process {X N0 γ (t) : t ≥ 0} defined by Note that in the process X N0 γ , each reaction k fires at a rate of order N β k +γ 0 .Hence reactions can be termed as "fast", "slow" or "natural" according to whether β k + γ > 0, β k + γ < 0 or β k + γ = 0 respectively.Note that as the value of N 0 increases, the slow reactions get slower and the fast reactions get faster.On the other hand, the natural reactions remain unaffected by the increase in N 0 .If we simulate the process X N0 γ using Gillespie's Stochastic Simulation Algorithm, then the fast reactions take up most of the computational time, making the simulation procedure extremely cumbersome.
Fortunately in certain situations, we can obtain a fairly good approximation of the dynamics by simulating a reduced model which does not contain any fast reactions.The state variables in this reduced model correspond to linear combinations of species numbers that are unaffected by the fast reactions (see [5,37]).As described in [22], such model reductions can be derived by replacing N 0 by N and showing that for a certain projection map Π on R d , the sequence of processes {ΠX N γ : N ∈ N} has a well-defined limit as N → ∞.The limiting process X corresponds to the stochastic model of a reduced reaction network made up of only those reactions that are "natural" for the reference time-scale γ, making its simulation far less computationally demanding than the original model.In Section 2 we present these model reduction results in greater detail.Now suppose that the output of interest is given by a real-valued function f and we would like to estimate the expectation E f (X N0 γ (t)) for some observation time t ≥ 0. If f is invariant under the projection Π (that is, f (x) = f (Πx) for all x ∈ N d 0 ) then we would expect that lim (1.1) This limit implies that for large values of N 0 , the quantity E(f (X N0 γ (t))) is "close" to E(f ( X(t)).Hence instead of estimating the former quantity directly we can estimate the latter quantity through simulations of the reduced model, and save a significant amount of computational effort.
As stated before, our aim in this paper is to tie these model reduction results with sensitivity analysis.Suppose that the propensity functions λ 1 , . . ., λ K depend on a scalar parameter θ.Now when the state is x, the k-th reaction fires at rate N β k 0 λ k (x, θ).With these propensity functions, we can define the processes X N0 γ,θ and X N γ,θ as before, where the subscript θ is introduced to make the parameter dependence explicit.For an output function f chosen as above, we would like to estimate the sensitivity of the expectation E(f (X N0 γ (t))) with respect to θ.In other words, we are interested in estimating We remarked before that most direct methods to estimate this quantity are simulation-based.Since simulations of the process X N0 γ,θ are very expensive, it is worthwhile to explore the possibility of using reduced models to obtain a close approximation for S N0 γ,θ (f, t).Suppose that for each θ we have a process X θ which corresponds to the reduced model.Moreover there exists a projection Π (independent of θ) such that ΠX N γ,θ converges in distribution to X θ as N → ∞.Then similar to (1.1) we would get lim However this relation does not ensure that lim because in general, limits and derivatives do not commute.Note that if (1.3) holds then for large values of N 0 , the quantity S N0 γ,θ (f, t) is close to the value which can be easily estimated using any of the sensitivity estimation methods [16,31,34,1,17], since simulations of the reduced model is computationally much easier than the original model.This motivates the main result of the paper which is essentially to show that (1.3) holds under certain conditions.In the above discussion we had assumed that the output function f is invariant under the projection Π, which is a highly restrictive assumption.Therefore we will prove a relation analogous to (1.3) for a general function f .Even though our result is easy to state, its proof is quite technical.The main complication comes from the fact that the dynamics at different time-scales, may interact with each other in non-linear ways.Due to this problem, the proof of our main result involves several steps which are loosely described below.We mentioned above that for a certain projection Π, the process ΠX N γ,θ may have a well-defined limit as N → ∞.In such a situation, the left-over part of the process, (I − Π)X N γ,θ1 , does not converge in the functional sense but it converges in the sense of occupation measures (see [22] or Section 2).As reported in [32], the distribution of occupation measures of Markov processes is related to the evolution of a system of multi-dimensional wave equations.Using this relation we construct another process W N θ whose distribution has some regularity properties with respect to θ.The process W N θ captures the one-dimensional distribution of the process X N γ,θ , which means that for any function f and time t, we can find a function g such that Furthermore, the fast components of the dynamics are averaged out in the process W N θ , making it simpler to analyze than the original process X N γ,θ .Next we couple the processes W N θ and W N θ+h (for a small h) in such a way, that it allows us to take the limits h → 0 and N → ∞ (in this order) of an appropriate quantity and prove our main result.This coupling is constructed using the random time change representation of Kurtz (see Chapter 7 in [9]).
As a corollary of our main result we obtain an important relationship which can be useful in estimating steady-state parameter sensitivities.Let X θ be a stochastic process which models the dynamics of the reaction network described above, with β k = 0 for each k and γ = 0. Assume that this process is ergodic with stationary distribution π θ and this distribution is difficult to compute analytically.Ergodicity implies that for any output function f we have where the integral is taken over the state space of X θ .Suppose we are interested in computing the steadystate parameter sensitivity given by Since π θ is unknown, this quantity cannot be computed directly and one has to estimate it using simulations.This can be problematic because simulations can only be performed until a finite time, and in general one is not sure if the sensitivity value estimated at a finite (but large t) is close to the steady-state value.However using our main result, we can conclude that under certain conditions we have The details are given in Section 3.1.Relation 1.4 proves that for a large (but finite) t, the steady-state parameter sensitivity is well-approximated by which can be estimated using known simulation-based methods [16,31,34,1,17].Note that (1.4) is sometimes implicitly assumed (see [36] for example) without proof.
All the results in the paper are stated for a scalar parameter θ, but the extension of these results for vector-valued parameters is relatively straightforward.Finally we would like to mention the even though our paper is written in the context of chemical reaction networks, our main result can be applied to any continuous time Markov process over a discrete lattice with time-scale separation in the transition rates.Other than reaction networks, such processes arise naturally in queuing theory and population modeling.
This paper is organized as follows.In Section 2 we discuss the model reduction results for multiscale networks.The results stated there are simple adaptations of the results in [22].Our main result is presented in Section 3 and its proof is given in Section 4. In Section 5 we provide an illustrative example to show how our result can be useful.

Notation
We now introduce some notation that we will use throughout this paper.Let R, R + , Z, N and N 0 denote the sets of all reals, nonnegative reals, integers, positive integers and nonnegative integers respectively.For any a, b ∈ R, their minimum is given by a ∧ b.The positive and negative parts of a are indicated by a + and a − respectively.The number of elements in any finite set E is denoted by |E|.By Unif(0, 1) we refer to the uniform distribution on (0, 1).If Π is a projection map on R n then we write Πx instead of Π(x) for any x ∈ R n and for any S ⊂ R n , the set ΠS is given by ΠS = {Πx : x ∈ S}.
The vectors of all zeros and all ones in R n are denoted by 0 n and 1 n respectively.Let M(n, n) be the space of all n × n matrices with real entries.For any M ∈ M(n, n), the entry at the i-th row and the j-th column is indicated by M ij .The transpose and inverse of M are indicated by M T and M −1 respectively.The symbol I n refers to the identity matrix in M(n, n).For any v = (v 1 , . . ., v n ) ∈ R n , Diag(v) refers to the matrix in M(n, n) whose non-diagonal entries are all 0 and whose diagonal entries are v 1 , . . ., v n .A matrix in M(n, n) is called stable if all its eigenvalues have strictly negative real parts.While multiplying a matrix with a vector we always regard the vector as a column vector.
Let (S, d) be a metric space.Then by B(S) we refer to the set of all bounded real-valued Borel measurable functions on S. By P(S) we denote the space of all Borel probability measures on S.This space is equipped with the weak topology.The space of cadlag functions (that is, right continuous functions with left limits) from [0, ∞) to S is denoted by D S [0, ∞) and it is endowed with the Skorohod topology (for details see Chapter 3, Ethier and Kurtz [9]).For any f ∈ D S [0, ∞) and t > 0, f (t−) refers to the left-limit lim s→t − f (s).
An operator A on B(S) is a linear mapping that maps any function in its domain D(A) ⊂ B(S) to a function in B(S).The notion of the martingale problem associated to an operator A is introduced and developed in Chapter 4, Ethier and Kurtz [9].In this paper, by a solution of the martingale problem for A we mean a measurable stochastic process X with paths in is a martingale with respect to the filtration generated by X.For a given initial distribution π ∈ P(S), a solution X of the martingale problem for A is a solution of the martingale problem for (A, π) if π = PX(0) −1 .If such a solution X exists uniquely for all π ∈ P(S), then we say that the martingale problem for A is wellposed.Additionally, we say that A is the generator of the process X.
Throughout the paper ⇒ denotes convergence in distribution.

Model Reduction results for multiscale networks
In this section we present the model reduction results for multiscale networks.Recall the definition of the process X N γ from Section 1.We shall soon see that this process is well-defined under some assumptions on the propensity functions.Our primary goal in this section, is to find the values of the reference time-scale γ such that the process X N γ has a well-behaved limit as N → ∞.This limit may not exist for the whole process but only for a suitable projection of the process.When the limit exists, the limiting process can be viewed as the stochastic model of a reduced reaction network, which only has reactions firing at a single time-scale.The results mentioned in this section are derived from the more general results in [22].Before we proceed we define a property of real-valued functions.Definition 2.1 Let U be a subset of R m , f be a real-valued function on U and Π be a projection map on R m .We say that the function f is polynomially growing with respect to projection Π if there exist constants C, r > 0 such that (2.5) We say that a function f in linearly growing with respect to projection Π if (2.5) is satisfied for r = 1.
A sequence of real-valued functions {f N : N ∈ N} on U is said to be polynomially (linearly) growing with respect to projection Π if for some C > 0 and r > 0 (r = 1), the relation (2.5) holds for each f N .A function (or a sequence of functions) is called polynomially (linearly) growing if it is polynomially (linearly) growing with respect to the identity projection I.
Our first task is to ensure that there is a well-defined process which describes the stochastic dynamics of our multiscale reaction network.For this purpose we make certain assumptions.
(A) For any k and x ∈ N d 0 , if λ k (x) > 0 then (x + ζ k ) has all non-negative components.
(B) Let P be the set of those reactions which have a net positive affect on the total population, that is, Then the function λ P : N d 0 → R + defined by λ P (x) = k∈P λ k (x) is linearly growing.
Parts (A) of this assumption prevents the reaction dynamics from leaving the state space N d 0 .The significance of part (B) will become clear in the next paragraph.Informally, part (B) says that all the reactions that add molecules into the system have orders 0 or 1.If there is a compact set S such that for each k, λ k (x) = 0 for all x / ∈ S, then part (B) is trivially satisfied.Let x 0 be a vector in N d 0 .Throughout the paper, the initial state of the reaction dynamics is fixed to be x 0 ∈ N d 0 and the corresponding stoichiometric compatibility class is given by Part (A) of Assumption 2.2 ensures that the reaction dynamics is always inside S. From the description of the multiscale network with reference time-scale γ (see Section 1), it is clear that the generator of the reaction dynamics should be given by the operator A N γ whose domain is D(A N γ ) = B(S) and its action on any f ∈ B(S) is given by

.7)
From Lemma A.1 we can argue that under Assumption 2.2, the martingale problem for A N γ is well-posed.Hence we can define X N γ as the Markov process with generator A N γ and initial state x 0 .The random time change representation (see Chapter 7 in [9]) of this process is given by where {Y k : k = 1, . . ., K} is a family of independent unit rate Poisson processes.

Convergence at the first time-scale
From (2.8), it is immediate that if the reference time-scale γ is such that β k + γ ≤ 0 for each k, then all the reactions are either "slow" or "natural" at this time-scale2 .Therefore we would expect the dynamics to converge as N → ∞ and the limiting dynamics will only consist of the natural reactions.
To make this precise, define Then γ 1 is the first time-scale for which the process X N γ1 has a non-trivial limit as N → ∞ and Γ 1 is the set of natural reactions for this time-scale.Note that and hence using (2.8) we can show that X N γ1 ⇒ X as N → ∞, where the process X satisfies In other words, X is the process with initial state x 0 and generator C 0 given by (2.11) The well-posedness of the martingale problem for C 0 can be verified from Lemma A.1 and therefore the process X is well-defined.The precise statement of this convergence result is given below.
Observe that this proposition can be viewed as a model reduction result, which says that at the time-scale γ 1 , the dynamics of the original model (given by X N0 γ1 ) is well-approximated by the dynamics of a reduced model (given by X) for large values of N 0 .This reduced model is obtained by simply dropping the "slow" reactions from the network.Such a model reduction result is trivial because one can easily see from the reaction time-scales that the slow reactions will not participate in the limiting dynamics.In the next section we describe a non-trivial model reduction result which is more useful from the point of view of applications.

Convergence at the second time-scale
As discussed in several recent papers [4,22], there may be a second time-scale γ 2 (> γ 1 ) so that a certain projection Π 2 of the process X N γ2 has a well-behaved limit as N → ∞.At this second time-scale, the network has "fast" reactions in addition to the "slow" and "natural" reactions.The projection Π 2 is such, that the fast reactions do not affect the projected process Π 2 X N γ2 .Assuming quasi-stationarity for the fast subnetwork [18,30] we can have a well-defined limit X for the process Π 2 X N γ2 .Moreover the limiting process X corresponds to the stochastic model of a reduced reaction network which only contains those reactions that are natural for the time-scale γ 2 .
We now describe this convergence result formally.Suppose that the set is non-empty.Then for any v ∈ S 2 , the process { v, X N γ2 (t) : t ≥ 0} is unaffected by the reactions in Γ 1 .Let (2.12) Then γ 2 > γ 1 by definition and note that the reactions in Γ 1 are fast at the time-scale γ 2 .Let L 2 be the subspace spanned by the vectors in S 2 and let Π 2 be the projection map from which means that the fast reactions would leave the process Π 2 X N γ2 unchanged.Let L 1 be the space spanned by the vectors in (I − Π 2 )S = {(I − Π 2 )x : x ∈ S}, where I is the identity map.For any v ∈ Π 2 S let and define the operator C v by (2.15) The operator C v can be seen as the generator of a Markov process with state space H v .We now define the occupation measure of the process (I −Π 2 )X N γ2 .This is a random measure on L 1 ×[0, ∞) given by where C is any Borel measurable subset of L 1 .Note that for any Therefore using (2.8) and (2.13), we can write the random time change representation for the process Π 2 X N γ2 as In other words, for any f ∈ B(S) and t > 0 Since we can expect from (2.16) that Π 2 X N γ2 ⇒ X as N → ∞ where the process X satisfies It can be seen that between consecutive jump times of the process Π 2 X N γ2 , if the state of the process Π 2 X N γ2 is v, then the process (I − Π 2 )X N γ2 evolves like a Markov process with generator C v .If the generator C v corresponds to an ergodic Markov process with the unique stationary distribution as π v ∈ P(H v ), then the limiting measure V has the form V (dy × ds) = π X(s) (dy)ds. (2.17) Therefore the random time change representation of the process X becomes where λ k (v) = Hv λ k (v + z)π v (dz).Before we state the convergence result, we need to make some assumptions.
Assumption 2.4 (A) For any v = Π 2 S, the space H v (given by (2.14)) is finite.
(B) The Markov process with generator C v is ergodic and its unique stationary distribution is π v ∈ P(H v ).
(C) Let P be the set of reactions given by Then the function λ P : N d 0 → R + defined by λ P (x) = k∈P λ k (x) is linearly growing with respect to projection Π 2 (see Definition 2.1).
Observe that part (C) implies that the functions { λ k : k ∈ Γ 2 } satisfy part (B) of Assumption 2.2.Therefore the process X satisfying (2.18) is well-defined due to Lemma A.1.Note that the set H v can either be finite or countably infinite.Our main result (Theorem 3.2) should hold in both the cases, but to simplify the proof we assume that H v is finite (part (A) of Assumption 2.4).We later discuss how the proof changes when this is not the case (see Remark 4.18).In many important biochemical multiscale networks, the fast reactions conserve some quantity that only depends on the natural dynamics (see [5,37,29]).In such a scenario, the set H v will be finite.We now state the convergence result at the second time-scale.
Proposition 2.5 Suppose that Assumption 2.2 and 2.4 hold.Then where the process X satisfies (2.18) and V satisfies (2.17).
Proof.The proof follows from Theorem 5.1 in [22].

Convergence at higher time-scales
In Section 2.2 we outlined a systematic procedure to obtain a single-step model reduction for a multiscale reaction network.The main idea was to assume ergodicity for the "fast" sub-network and incorporate its steady-state information in the propensities of the "natural" reactions.Moreover the "slow" reactions can be ignored completely.This single-step reduction process can be carried over multiple steps to construct a hierarchy of reduced models.This is useful because many biochemical networks have reactions spanning several time-scales (see [21], for example).Hence for a given reference time-scale, many steps of model reduction may be required to a obtain a model which is simple enough, to be amenable for extensive simulations that are required for sensitivity estimation.
For our main result, we will assume that we are in the situation of Proposition 2.5, which describes a single-step model reduction.In Section 3.2, we shall discuss how our result can be used to estimate parameter sensitivity using reduced models that are obtained after many steps of model reduction.

The Main Result
In this section we present our main result on sensitivity analysis of multiscale networks.Suppose that the propensity functions λ 1 , . . ., λ K depend on a real-valued parameter θ and Assumption 2.2 are satisfied for each value of θ.If the reference time-scale is γ, then the reaction dynamics will be captured by the generator

.20)
Using Lemma A.1 we can argue that the martingale problem corresponding to A N γ,θ is well-posed.Let X N γ,θ be the process with generator A N γ,θ and initial state x 0 .We use the same notation as in Section 2.2.Note that the definitions of γ i , Γ i , S i and L i , for i = 1 and 2, only depend on the stoichiometry of the reaction network and are hence independent of θ.Similarly the projection map Π 2 and the space H v (see (2.14)) do not depend on θ.The definition of the operator C v (see (2.15)) changes to For our main result we require the following assumptions.(B) A Markov process with generator C v θ is ergodic and its unique stationary distribution is π v θ ∈ P(H v ).(C) Let x ∈ S be fixed.Then for any k = 1, . . ., K, the function λ k (x, •) is twice-continuously differentiable in a neighbourhood of θ.
(E) The functions Note that if Assumption 3.1 hold then Assumption 2.4 will also hold.Hence Proposition 2.5 ensures that Π 2 X N γ2,θ ⇒ X θ as N → ∞.The process X θ has initial state Π 2 x 0 and generator A θ given by where the function We now state our main result whose proof is given in Section 4.3.
Theorem 3.2 Suppose that Assumption 3.1 hold and the function f : S → R is polynomially growing with respect to projection Π 2 .Then for any t > 0 we have where f θ : Π 2 S → R is given by Remark 3.3 This theorem will also hold if the function f depends on the parameter θ, as long as the dependence is continuously differentiable.This will be evident from the proof of the theorem.
Recall that the reaction dynamics for the orginal model in the reference time-scale γ 2 is given by X N0 γ2,θ .If the output of interest is captured by function f , then we are interested in estimating the parameter sensitivity S N0 γ2,t (f, t) defined by (1.2).As explained in Section 1, direct estimation of S N0 γ2,t (f, t) is often infeasible because simulations of the process X N0 γ2,θ are prohibitively expensive.However simulations of the reduced model dynamics X θ is much cheaper, allowing us to easily estimate the right side of (3.24), using known methods [16,31,34,1,17].The main message of Theorem 3.2 is that for large values of N 0 which allows us to approximately estimate S N0 γ2,t (f, t), in a computationally efficient way.Observe that in (3.24), the function f θ may depend on θ even if the function f does not.If the stationary distribution π x θ is known for each x ∈ Π 2 S, then the function f θ and the propensities λ k can be computed analytically.In this case, the simulations of the process X θ that are needed for estimating S θ (f θ , t), can be carried out using the slow-scale Stochastic Simulation Algorithm [5].If π x θ is unknown, then one can use nested schemes [37,6] to estimate f θ and λ k during the simulation runs.In many applications, the "fast" reactions are uninteresting [29,30,18] and they do not alter the output function f .In such a scenario we can expect f to be invariant under the projection Π 2 (that is, f (x) = f (Π 2 x) for all x ∈ S) which would imply that the functions f θ and f are the same on the space Π 2 S. Hence we recover (1.3) from Theorem 3.2.

Estimation of steady-state parameter sensitivities
We now discuss how relation (1.4) can be derived using our main result.In Section 1 we mentioned the importance of this relation in the context of estimating steady-state parameter sensitivities.Let {X θ (t) : t ≥ 0} be an ergodic S-valued Markov process with generator and stationary distribution π θ .If we define another process X N θ by then X N θ represents the dynamics of a multiscale network with β k = 1 for each k = 1, . . ., K. For this network, clearly γ 2 = 0, Γ 2 = ∅ and Π 2 S = {0}.From Theorem 3.2 we obtain lim for any t > 0. Hence (1.4) immediately follows from (3.27).

Sensitivity estimation with multiple reduction steps
We have presented Theorem 3.2 in the setting of Section 2.2, where a single-step reduction procedure was described to obtain a "reduced" model ( with dynamics X θ ) from the original model (with dynamics X N0 γ,θ ), in the reference time-scale γ = γ 2 .As mentioned in Section 2.3, there are examples of multiscale networks where many steps of model reduction may be required to arrive at a sufficiently simple model.It is interesting to know that even in such cases, the main approximation relationship (3.26) that falls out of Theorem 3.2, will continue to hold.To illustrate this point, we now consider an example where two-steps of model reduction are needed for sensitivity estimation.
Recall the description of a multiscale network from Section 1.Let γ 1 , γ 2 and γ 3 be real numbers such that γ 3 > γ 2 > γ 1 .Suppose that the sets Γ 1 , Γ 2 and Γ 3 form a partition of the reaction set {1, . . ., K}, and for each k ∈ Γ i , we have β k = −γ i for i = 1, 2, 3.The dynamics of the model in the reference time-scale γ is given by the process X N0 γ,θ whose random time change representation is where {Y k : k = 1, . . ., K} is a family of independent unit rate Poisson processes.Clearly this multiscale network has three time-scales γ 1 , γ 2 and γ 3 .Suppose we want to estimate the sensitivity value S N0 γ,t (f, t) (given by (1.2)) at the reference time-scale γ = γ 3 .Observe that for this time-scale, the reactions in both the sets Γ 1 and Γ 2 are "fast", but the reactions in Γ 1 are "faster" than those in Γ 2 .Ideally we would like to estimate S N0 γ3,t (f, t) using a reduced model which only involves reactions in Γ 3 .It is possible to obtain such a reduced model by applying the reduction procedure twice.We now demonstrate that even with this second-order reduced model, the main approximation relationship (3.26) will still hold. Replacing 28), we get another process X N0,N γ,θ defined by Certainly for large values of N 0 we have Observe that the process X N0,N γ3,θ can be treated in the same way as the process X N γ2,θ in Theorem 3.2.Suppose that the conditions of this theorem are satisfied.We can construct a projection Π 2 satisfying (2.13) such that the process Π 2 X N0,N γ3,θ has a well-behaved limit as N → ∞.For any v ∈ Π 2 S let π v θ be the stationary distribution for the Markov process with generator C v θ (see (3.21)).Define f θ by (3.25) and for each k ∈ Γ 2 ∪ Γ 3 let λ k be given by (3.23).Using Theorem 3.2 we can conclude that lim where X N0 γ3,θ is the Π 2 S-valued process given by Substituting N 0 by N we get another process X N γ3,θ which can again be dealt in the same way as the process Assuming that the conditions of Theorem 3.2 hold, we can construct a projection Π 3 , such that Π 3 Π 2 ζ k = 0 d for all k ∈ Γ 2 , and the process Π 3 X N γ3,θ has a well-behaved limit as N → ∞.For any w ∈ Π 3 Π 2 S, let µ w θ be the stationary distribution for the Markov process with generator where the definition of H w is similar to (2.14).Define for each k ∈ Γ 3 .From Theorem 3.2 we get lim where X θ is the process given by Combining (3.30), (3.31), (3.32) and (3.33), we get that for large values of N 0 This shows that the main approximation relationship ((3.26)) that arises from Theorem 3.2 will hold even with a reduced model obtained after two steps of model reduction.Observe that the reactions in Γ 3 are "natural" for the time-scale γ 3 , and the reduced model corresponding to X θ only consists of these reactions.Hence the process X θ is easy to simulate and S N0 γ3,t (f, t) can be easily estimated using (3.34).

Proofs
We mentioned in Section 1 that the proof of our main result, Theorem 3.2, will require many steps.We now describe these steps in detail.In Section 4.1 we show some regularity properties of the distributions of weighted occupation times for finite Markov chains with fast parameter-dependent rates.For this, we exploit certain connections between the distribution of weighted occupation times and multi-dimensional wave equations (see [32]).These regularity properties allows us to later argue that the distribution of the weighted occupation times for the "fast" sub-network of our multiscale network, is differentiable with respect to θ, and the derivative operation commutes with the limt N → ∞.In Section 4.2, we construct a "new" process W N θ , which captures the one-dimensional distribution of the process X N γ2,θ , in the sense described in Section 1.The main difference between X N γ2,θ and W N θ , is that the dynamics of the fast sub-network is averaged out in the process W N θ , making it easier to work with.In particular the process W N θ is wellbehaved limit as N → ∞ (see Proposition 4.16), unlike the process X N γ2,θ .The proof of Theorem 3.2 is given in Section 4.3.The main idea of the proof is to couple the processes W N θ and W N θ+h , in such a way, that it allows us to compute a double-limit of the form lim for some functions f N θ and f N θ+h that depend on our output function f .The results from Section 4.2 will imply that this quantity is equal to the left-hand side of (3.24).On the other hand, using Dynkin's formula (see Lemma 19.21 in [20]) and some coupling arguments, we will show that this quantity is also equal to the right-side of (3.24), thereby proving Theorem 3.2.

Weighted occupation times of finite Markov chains
Let {Z(t) : t ≥ 0} be a continuous time Markov chain on a finite state space E = {e 1 , . . ., e m } and with generator Here λ 1 , . . ., λ K are positive functions on E. For this Markov chain the Q-matrix (matrix of transition rates) is given by then V (t) is essentially the weighted occupation time of the process Z, where the weight is given by the function Λ.For each i = 1, . . ., m define p i , β i : R + → [0, 1] by Note that β i (t) can be seen as the Laplace Transform of the distribution of V (t) on the event Z(t) = e i .Let p(t) and β(t) denote the vectors The definition of matrix Q implies that The next proposition describes the dynamics of β.
Proof.Let r 1 , . . ., r l be l distinct values in the set {Λ(e 1 ), . . ., Λ(e m )}, arranged in the ascending order.For each i = 1, . . ., (l − 1) let The random variable V (t) (given by (4.35)) can only take values between r 1 t and r l t.Hence It has been shown in [32] that the distribution of the real-valued random variable V (t) is continuous in the interval [r 1 t, r l t], except at points r 1 t, . . ., r l t.Whenever x = r j t for some j = 1, . . ., l, the function F i has a discontinuity of size Moreover, the event for any g : R + → R + .It is shown in [32] that on the set R = {(t, x) : t > 0 and x ∈ (r j−1 t, r j t), j = 2, . . ., l}, each F i is continuously differentiable and the family of functions {F i : i = 1, . . ., m} satisfies the following system of multi-dimensional wave equations For each i = 1, . . ., m we can write β i (t) as Substituing the above expression in (4.40) we obtain where p i (t) = P(Z(t) = e i ).
For i = 1, . . ., m, the functions p i and F i (•, x) are differentiable (see (4.36) and (4.39)).Hence the function β i is also differentiable.Taking derivative with respect to t in (4.41) yields From (4.37) and (4.38) it follows that l j=2 r j e −rj t F i (t, r j t) − r j−1 e −rj−1t F i (t, r j−1 t) Therefore From (4.39) we get Due to (4.36), the last term is 0 and hence Substituting this expression in (4.43) yields This completes the proof of the proposition.
Using the above proposition, we now establish some regularity properties of the distributions of weighted occupation times for finite Markov chains with fast parameter-dependent rates.Let {Z N θ (t) : t ≥ 0} be a continuous time Markov chain on E = {e 1 , . . ., e m } with generator given by where the function θ → λ k (z, θ) is continuously differentiable for each k and z ∈ E. For this Markov chain, the matrix of transition rates is given by N Q θ where We assume that this Markov chain is ergodic.Then its unique stationary distribution π θ is a left eigenvector for Q θ corresponding to the eigenvalue 0. Hence Remark 4.2 Due to the ergodicity assumption, the matrix Q θ has 0 as a simple eigenvalue and all its other eigenvalues have strictly negative real parts.
For a function Λ : and let for each i = 1, . . ., m. From Proposition 4.1 it follows that the function where D θ is the m × m diagonal matrix with entries Λ(e 1 , θ), . . ., Λ(e m , θ).We now define a condition on sequences of functions on R + .f N (t) = 0 and lim The main result of this section is given as the next proposition.
where Proof.We start by defining some notation that will be useful in the proof.We say that a R m -valued sequence {a N : N ∈ N} belongs to class O(N −m ) for some m ∈ N 0 , if and only if sup For two such sequences {a N : N ∈ N} and {b N : N ∈ N}, we will say that a For the proof, we can assume without loss of generality, that for each N , Z N θ (0) = e i0 for some i 0 = 1, . . ., m.This implies that β N θ (0) = (0, . . ., 0, 1, 0, . . ., 0), where the 1 is in place i 0 .Hence To prove the proposition it is sufficient to show that both h N θ and ∂h N θ /∂θ satisfy Condition 4.3.
From (4.46) we obtain where I m is the m × m identity matrix.Consider the matrix B N θ = Q T θ − N −1 D θ , which can be seen as a small perturbation of Q T θ for large values of N .The eigenvalues of B N θ is slighly perturbed with respect to the eigenvalues of Q T θ (see [33]).We know that matrix Q T θ has 0 as a simple eigenvalue (see Remark 4.2) and the corresponding left eigenvector is 1 m .From Theorem 2.7 in [33], we can conclude that B N θ has an eigenvalue at λ N θ with the corresponding left eigenvector at v N θ , where λ N θ and v N θ have the form Taking inner product with v N θ in (4.50) we get ). Therefore we can write where {a N θ }, {b N θ } are sequences in O(N −1 ).Using (4.48) we obtain which also implies that sup This allows us to write Let C θ be the (m − 1) × (m − 1) matrix whose ij-th entry is given by If we define where v is some vector in R m−1 .The matrix Q θ has a simple eigenvalue at 0 and all its other eigenvalues have strictly negative real parts (see Remark 4.2).This shows that matrix C θ is stable.Let h N θ (t) and π θ be vectors containing the first (m − 1) components of h N (t) and π θ .Also let D θ be the (m − 1) × (m − 1) diagonal matrix with entries λ(e 2 , θ), . . ., λ(e m , θ).From (4.50) we get Let C N θ be the matrix given by The stability of matrix C θ implies that there exists a α > 0 such that for any t ≥ 0 and N The exact solution of (4.56) is This along with (4.54) shows that the function h N θ satisfies Condition 4. where ǫ N = 1/ √ N .Let H N θ : R + → R m be defined by Differentiating (4.50) with respect to θ we get Note that where the last equality is true because . Then G N θ satisfies an ordinary differential equation of the form where the sequences {e N θ }, {g N θ } are in O(N −1 ) and the sequence f N θ is in O(1).Gronwall's inequality along with (4.59) and (4.51) imply that sup If C N θ is the matrix given by (4.57), then we can solve for H N θ as Proof.The proof is immediate from (4.54) and (4.60).
We end this section with an important observation.
Remark 4.7 To prove Proposition 4.4 we used results from the theory of perturbation of finite matrices.Consider the situation where the state space E of the Markov chain is countably infinite.Now the matrix of transition rates Q θ is infinite and it can be seen as a linear operator on E. Proposition 4.1 will still hold in this case and assuming the existence of a suitable Lyapunov function (see [28]) for the Markov chain, one can use results from the perturbation theory of linear operators (see [10]) to prove Proposition 4.4 in a similar way.

Construction of a new process
In this section we construct a new process W N θ and study some of its properties.As mentioned before, this process captures the one-dimensional distribution of X N γ2,θ (see Section 1) and its dynamics does not involve any "fast" transitions.We begin by making a remark which will simplify the proof of Theorem 3.2.
Remark 4.8 Recall the description of the limiting process X θ from the statement of Theorem 3.2.Note that this process corresponds to a reduced model which does not contain any reactions in the set This suggests that we can prove Theorem 3.2 with the assumption that (Γ 1 ∪ Γ 2 ) c is empty.If this is not the case, then our proof can be adjusted easily to account for the reactions in (Γ 1 ∪ Γ 2 ) c .We will also set γ 2 = γ 1 + 1, which can be ensured by redefining N , if necessary.
From now on we will always assume that γ 2 = γ 1 + 1 and (Γ 1 ∪ Γ 2 ) c = ∅.Under these assumptions the random time change representation of {X N γ2,θ (t) : t ≥ 0} is given by From (2.13) we know that ζ s k = 0 d for each k ∈ Γ 1 .If we define two processes X N S,θ and X N F,θ by then their random time change representations are given by Remark 4.9 These representations show that between the successive jump times of X N S,θ , if the state of this process is v, then the process X N F,θ evolves like a Markov process with state space H v and generator N C v θ , where C v θ is given by (3.21).The above remark motivates the construction of the process W N θ .Before we describe this construction we need to define certain quantities.Let λ 0 (x, θ) = k∈Γ2 λ k (x, θ) and for any k where {Z N θ (t) : t ≥ 0} is an independent Markov process with initial state z and generator N C v θ .For any e ∈ H v define where Proof.Observe that Integrating both sides with respect to t and then exponentiating proves part (A).From (4.64) we get and this proves part (B).
Part (B) of Lemma 4.10 shows that for any k ∈ Γ 2 , v ∈ Π 2 S, z ∈ H v and t ≥ 0, we can regard Θ N k,θ (t, v, z, •) as a probability measure on H v .We know that H v is a finite set.From now on, whenever we write H v = {e 1 , . . ., e m }, we will assume that the elements are arranged in the lexicographical order on R d .For any u ∈ (0, 1) define Then a H v -valued random variable with distribution Θ N k,θ (t, v, z, •) can be generated by transforming a Unif(0, 1) random variable u with the function ̥ N k,θ (t, v, z, •).The next lemma will be useful in proving the main result.Lemma 4.11 Fix a v ∈ Π 2 S, z ∈ H v and t > 0. Let H v = {e 1 , . . ., e m } and u be a Unif(0, 1) random variable.Pick i, j ∈ {1, . . ., m} such that i = j.Then Proof.For proving this lemma we can assume that Θ N k,θ (t, v, z, e) > 0 for each e ∈ H v .Let H v = {e 1 , . . ., e m } and for any l = 1, . . ., m define Note that A m (θ) = 1 for any θ due to part(B) of Lemma 4.10.For convenience let A 0 (θ) = 0 for any θ.For small values of h we can write Since Θ N k,θ (t, v, z, e l ) > 0 for each l = 1, . . ., m, this probability is 0 if j > i + 1 or j < i − 1. Assume that j = i + 1 for i < m.Then for smal values of h we can write Therefore Similarly for j = i − 1 and i > 1 we can show that Combining the last two relations proves the lemma.
The new process W N θ will be a Markov process on state space S given by Let Π S be the projection map from S to Π 2 S defined by We now define a class C of bounded real-valued functions over S by Let {W N θ (t) : t ≥ 0} be the S-valued Markov process with initial state (0, v 0 , z 0 ) = (0, Π 2 x 0 , (I − Π 2 )x 0 ) and generator given by The existence and uniqueness of the process W N θ is a direct consequence of the well-posedness of the martingale problem for B N θ , which is verified in Lemma A.2.In the rest of this section we study some properties of the process W N θ .Observe that the definition of S (see (4.70)) allows us to write where τ N θ , V N θ and Z N θ are processes with state spaces R + , Π 2 S and ∪ v∈Π2S H v respectively.Let σ N i denote the i-th jump time of the process W N θ for i = 1, . . . .We define σ N 0 = 0 for convenience.From the form of the generator B N θ it is immediate that between the jump times, τ N θ increases linearly at rate 1 while V N θ and Z N θ remain constant.Hence

.75)
Let η i be the Γ 2 -valued random variable that denotes the direction of the jump at time σ N i and let ξ i be the random variable given by Z N θ (σ N i −).The form of B N θ allows us to compute the distributions of the random variables (σ N i − σ N i−1 ), η i and ξ i from the values of V N θ (σ N i−1 ) and Z N θ (σ N i−1 ).Let E i (v, z) denote the event Then given for any i = 1, 2, . . . .

Remark 4.12
The preceding discussion suggests a simple scheme to construct the process with generator B N θ and initial state (0, v 0 , z 0 ).Consider the random time change representation where {Y k : k ∈ Γ 2 } is a family of independent unit rate Poisson processes.The processes τ N θ , V N θ and Z N θ can be constructed as follows.For each i ∈ N 0 let σ N i be the i-th jump time of process V N θ , where σ N 0 = 0. Defining (τ N θ (0), V N θ (0), Z N θ (0)) = (0, v 0 , z 0 ) constructs the process W N θ until time σ N 0 .Assume that this process is constructed until time σ N i−1 for some i = 1, 2, . . . .Then the next jump time σ N i can be evaluated from (4.79) and the process W N θ can be defined in the time interval t then we choose random variables η i and ξ i according to distributions (4.77) and Θ N ηi,θ (t, v, z, •) respectively and define This completes the construction of the process until the next jump time σ N i .Proceeding this way we can define ) for all t ≥ 0. The relation (4.78) ensures that the process W N θ has generator B N θ .In the next proposition we show that the one-dimensional distribution of the process X N γ2,θ can be captured with the process W N θ .
Proposition 4.13 For i ∈ N, let δ N i and σ N i denote the i-th jump time of the processes Π 2 X N γ2,θ and W N θ respectively.We define δ N 0 = σ N 0 = 0 for convenience.Then we have the following.
(A) Let the processes V N θ and Z N θ be related to the process W N θ by (4.74).For each i = 0, 1, 2, . . ., Then for any t ≥ 0 where f θ : S → R is the function given by Proof.We prove part (A) by induction in i. Relation (4.80) certainly holds for i = 0. Suppose it holds for (i − 1) for some i ∈ N. Then where the processes X N S,θ and X N F,θ are given by (4.61).For any v ∈ Π 2 S and z ∈ H v let E i−1 (v, z) denote the event Let η i be the Γ 2 -valued random variable that gives the jump direction of the process X N S,θ at time δ N i .For any t > 0, k ∈ Γ 2 and e ∈ H v we can write

.84)
Let {Z N θ (t) : t ≥ 0} be an independent Markov process with initial state z and generator N C v θ .For each k ∈ Γ 2 let u k be an independent Unif(0, 1) random variable.Using the observation made in Remark (4.9), and the random time change representation (4.62) we can write where o(h) denotes any quantity which upon division by h, goes to 0 as h → 0. To obtain (4.85) we integrated with respect to the joint density of {u k : k ∈ Γ 2 }.Note that due to (4.65) and (4.66) we get Hence relations (4.84) and (4.85) yield From (4.78) it follows that for all v ∈ Π 2 S and z ∈ H v This relation and (4.82) imply that which completes the proof of part (A).We now prove part (B).From Remark 4.14 and Lemma A.2 we can conclude that for any t ≥ 0 sup Moreover, one can rework the proof of part (C) of Lemma A.1 to show that sup Let {F t } be the filtration generated by the process {X N γ2,θ (t) : t ≥ 0}.Then we can write For any v ∈ Π 2 S and z ∈ H v , let E i−1 (v, z) be the event given by (4.83).Suppose H v = {e 1 , . . ., e m } and {Z N θ (t) : t ≥ 0} is an independent Markov process with initial state z and generator N C v θ .For each k ∈ Γ 2 let u k be an independent Unif(0, 1) random variable.Using the observation made in Remark (4.9), and the random time change representation (4.62), for any s < t we can write The last inequality is obtained by integrating with respect to the joint density of {u k : k ∈ Γ 2 }.Due to (4.65) and (4.81) we obtain Substituting this relation in (4.86) and using part (A) gives us However from (4.75) and (4.76) we can conclude that This proves part (B) of the proposition.
Part (B) of Assumption 3.1 says that a Markov process with generator C v θ is ergodic and its unique stationary distribution is π v θ ∈ P(H v ).Since H v is finite, we can view π v θ as a vector in R n where n = |H v |.The differentiability of π z θ with respect to θ follows from arguments given in Section 4.1.Let {f N : N ∈ N} be a sequence of real valued functions on R + and let c be a constant.In the next lemma we will use the notation f N → c to denote that the sequence of functions {( f N − c) : N ∈ N} satisfies Condition 4.3.
where λ k is defined by (3.23).
(C) Fix a function f : S → R. Let f N θ and f θ be given by (4.81) and (3.25) respectively.Then Proof.Assume that H v = {e 1 , . . ., e m }.For each l = 1, . . ., m, let β N θ,l : R + → R be given by where From Corollary 4.6 we get that for any T > 0 lim and lim Using part (A) of Lemma 4.10 we can write .
From Proposition 4.4 we can see that each β N θ,l satisfies Condition 4.3.This fact along with (4.87) and (4.88) proves part (A).
The proof of part (B) is immediate from the definition of Θ N k,θ (see (4.66)), part (A), (4.87) and (4.88).Note that f N θ can be written as , which enables us to prove part (C) in the same way as part (A).
For the next proposition, recall the definition of the projection map Π S from (4.71) and the definition of A θ from (3.22).Proposition 4.16 Fix (t 0 , v 0 , z 0 ) ∈ S and let W N θ be the Markov process with generator B N θ and initial state (t 0 , v 0 , z 0 ).Then the sequence of processes {W N θ : N ∈ N} is tight in the space D S [0, ∞).Let W θ be a limit point of this sequence and let X θ be the process with generator A θ and initial state v 0 .Then the process Π S W θ has the same distribution as the process X θ .Remark 4.17 Note that this proposition proves that Π S W N θ ⇒ X θ as N → ∞.
Proof.The tightness of the sequence of processes {W N θ : N ∈ N} is argued in Lemma A.2. Let the process W θ be a limit point of this sequence.For any function g ∈ B(Π 2 S), define another function f : S → R by Then the function f is in the class C (see (4.72)) and the action of B N θ (see (4.73)) on f is given by This shows that the following is a martingale Since g is bounded, Lemma 4.15, the continuous mapping theorem and Lemma A.2 imply that as N → ∞, we have m N g ⇒ m g where is also a martingale.This shows that {Π S W θ (t) : t ≥ 0} satisfies the martingale problem for operator A θ (given by (3.22)).Moreover Π S W θ (0) = X θ (0) = v 0 .Since the martingale problem for A θ is well-posed, the process Π S W θ has the same distribution as the process X θ and this proves the the proposition.

Proof of Theorem 3.2
We now have all the tools to prove our main result.But first we need to define some quantities and provide some preliminary results.For any function f : S → R, (t 0 , v 0 , z 0 ) ∈ S and t ≥ 0 define where {W N θ (t) : t ≥ 0} is the process with generator B N θ (see (4.73)) and initial state (t 0 , v 0 , z 0 ).Similarly for any function g : where { X θ (t) : t ≥ 0} is the process with generator A θ (see (3.22)) and initial state v 0 .Now consider a function f : S → R which is polynomially growing with respect to projection Π 2 .Corresponding to this function define f N θ : S → R by (4.81) and f θ : Π 2 S → R by (3.25).Remark 4.14 and Lemma A.2 imply that for any T > 0 sup If σ is a stopping time with respect to the filtration generated by W N θ , then due to part (E) of Lemma A.2 we have Proposition 4.16 shows that the sequence of processes {W N θ : N ∈ N} is tight and Π S W N θ ⇒ X θ as N → ∞ (see Remark 4.17).This fact along with part (C) of Lemma 4.15 proves that for any T > 0 lim where ǫ N = 1/ √ N .Observe that the right side of (3.24) can be written as where X θ and X θ+h are processes with initial state v 0 = Π 2 x 0 and generators A θ and A θ+h respectively.This shows that we can write S θ (f θ , t) as provided that the two limits exist.If ∂f θ /∂θ is the partial derivative of f θ with respect to θ, then for any This shows that the first limit in (4.94) is just Using coupling arguments we proved in [17] that the second limit in (4.94) is given by where

.98)
We now come to the proof of our main result, where we establish (4.98).The arguments used in the proof are motivated by the analysis in [17].
Proof.[Proof of Theorem 3.2] For the initial state x 0 let v 0 = Π 2 x 0 and z 0 = (I − Π 2 )x 0 .Let X N θ and X N θ+h be Markov processes with initial state x 0 and generators A N γ2,θ and A N γ2,θ+h respectively.Similarly let W N θ and W N θ+h be Markov processes with initial state (0, v 0 , z 0 ) and generators B N θ and B N θ+h respectively.From part (B) of Proposition 4.13 we know that For any (t, v, z) ∈ S, f N θ (t, v, z) is a continuously differentiable function of θ.Hence we can write This expansion along with (4.99) gives us where Proposition 4.16 shows that the sequence of processes {W N θ : N ∈ N} is tight and if W θ is a limit point then the process Π S W θ has the same distribution as the process X θ .This fact along with part (C) of Lemma 4.15 shows that for t > 0 lim In order to compute the limit of S N,2 θ (f, t) as N → ∞, we will couple the processes W N θ and W N θ+h in a special way.We need to define certain quantities to describe the coupling.For any (t We define the processes V N θ and V N θ+h by the following random time change representations where {Y k , Y k , Y : k ∈ Γ 2 } is a family of independent unit rate Poisson processes.To V N θ (V N θ+h ) we associate processes τ N θ (τ N θ+h ) and Z N θ (Z N θ+h ) as in Remark 4.12.The above representations couple the processes V N θ and V N θ+h .For each i ∈ N, let σ 1 i (σ 2 i ) be the i-th jump time of the process V N θ (V N θ+h ) and let η 1 i (η 2 i ) be the jump direction of the process V N θ (V N θ+h ) at time σ 1 i (σ 2 i ).Define σ 1 0 = σ 2 0 = 0. Fix a sequence {u i : i ∈ N} of independent Unif(0, 1) random numbers.We couple the processes Z N θ and Z N θ+h , by letting ), u i ) for each i, where the function ̥ N is defined by (4.69).Note that we are using the same u i in the definition of Z N θ (σ 1 i ) and Z N θ+h (σ 2 i ).Define W N θ and W N θ+h by One can verify that the processes W N θ and W N θ+h have initial state (0, v 0 , z 0 ) and generators B N θ and B N θ+h respectively.Let γ N h be the stopping time given by Then the coupling of processes W N θ and W N θ+h ensures that γ N h → ∞ a.s. as h → 0. Define and Note that f N θ (0, v 0 , z 0 ) = f (x 0 ).Using (4.92) we can write Using Taylor's expansion, for any f ∈ C and (t, v, z) ∈ S we get ds .
Proposition 4.16 shows that the sequence of processes {W N θ : N ∈ N} is tight and if W θ is a limit point then the process Π S W θ has the same distribution as the process X θ .This fact along with Lemma 4.15 implies that lim Our next goal is to compute lim N →∞ B N θ .Recall the definitions of Ψ N f,θ and Ψ f,θ from (4.89) and (4.90) respectively.For i = 1, 2, let (t i , v i , z i ) ∈ S. Define an event and let

.113)
Let ǫ N = 1/ √ N .From (4.92) and the strong Markov property, we can deduce that for any 0 < s < t where the last equality is due to (4.93).
Recall the random time change representations (4.104) and (4.105).For each i ∈ N, let σ N i be the i-th jump time of the process C N θ defined by

.115)
We now show that the term B N,1 θ,i converges to 0 as N → ∞.Note that the event {σ i be the Γ 2 -valued random variable which gives the direction of the jump in Assumptions 3.1 imply that the right hand side is a polynomially growing function with respect to projection Π S (see Definition 2.1).Given the events L i (δ, v, z, k) and Recall the definition of R N θ,h from (4.113).For any δ < s < t we can write lim k ) for some k ∈ Γ 2 .Let η be the Γ 2 -valued random variable which gives the direction of the jump in V N θ or V N θ+h at time γ N h .Define a random variable and an event where From (4.114) one can verify that lim where λ 0 (v, θ) = k∈Γ2 λ k (v, θ).Due to our coupling, as h → 0, the process W N θ+h converges a.s. to the process W N θ and hence γ N h → ∞ a.s.Proposition 4.16 and Remark 4.17 show that as N → ∞ we have V N θ ⇒ X θ , where X θ is the limiting process in Theorem 3.2.This convergence and (4.122) yield the following lim where σ i is the i-th jump time of the process X θ (with σ 0 = 0) and the function R k,θ be given by (4.97).
Note that the quanity on the right hand side is 0 if σ i−1 ≥ t.Using (4.115) and (4.118) we get lim This relation along with (4.100), (4.103), (4.109) and (4.111) gives us lim which is same as (4.98) and this completes the proof of the theorem.

An Illustrative Example
In this section we present a simple example to illustrate how our main result, Theorem 3.2, can be useful for the estimation of parameter sensitivity for multiscale networks.Consider a chemical reaction network with three species S 1 , S 2 and S 3 , and three reactions given by The rate constant of the i-th reaction is c i , for i = 1, 2, 3.Such a network is used to model the cellular heat-shock response in [7], where S 1 , S 2 and S 3 correspond to the σ 32 −DnaK complex, the σ 32 heat shock regulator and the σ 32 -RNAP complex, respectively.In this example, the first and second reactions are much faster than the third reaction.We assume that the rate constants are given by c 1 = 1, c 2 = 2 and c 3 = 5 × 10 −4 .
We choose our sensitive parameter to be θ = c 1 = 1 and the large normalization parameter to be N 0 = 10 4 .The three reactions along with their scaling factors (β k 's), propensity functions (λ k 's) and their stoichiometric vectors (ζ k 's) are presented in Table 1.
Let X N0 θ (t) = (X N0 θ,1 (t), X N0 θ,2 (t), X N0 θ,3 (t)) : t ≥ 0 be the stochastic process representing the dynamics of this multiscale reaction network.Hence for any time t ≥ 0 and i = 1, 2, 3, X N0 θ,i (t) denotes the number of molecules of S i .Suppose that the initial state of the system is X N0 θ (0) = (v 0 , 0, 0) for v 0 = 20.Note that the sum of the three species numbers is preserved by all the reactions.Hence the state space for the process X N0 θ is S = (x 1 , x 2 , x 3 ) ∈ N d 0 : x 1 + x 2 + x 3 = v 0 .Clearly for this multiscale network, the first time-scale is γ 1 = 0 (see Section 2.1) and the corresponding set of "natural" reactions is Γ 1 = {1, 2}.Similarly the second time-scale is γ 2 = −1 (see Section 2.2) and the corresponding set of "natural" reactions is Γ 2 = {3}.If the time-scale of reference is γ 2 then the dynamics is given by the Markov process X N γ2,θ with generator A N γ2,θ (see (3.20)) with N = N 0 .As described in Section 2.2, under certain conditions we can construct a projection Π 2 for which the process Π 2 X N γ2,θ has a well-behaved limit as N → ∞.In this example, this projection is given by Π 2 (x 1 , x 2 , x 3 ) = (x 1 + x 2 , x 3 ).Note that Π 2 ζ k = (0, 0) for each k ∈ Γ 1 and Π 2 ζ 3 = (−1, 1).For any v = (v 1 , v 2 ) ∈ Π 2 S, define the space H v (see (2.14)) by Let { X θ (t) = ( X θ,1 (t), X θ,2 (t)) : t ≥ 0} be the Π 2 S-valued process with the following random time change representation where Y is a unit rate Poisson process.The due to Proposition 2.5 we have Π 2 X N γ2,θ ⇒ X θ as N → ∞.Let f : R 3 → R be the function given by f (x 1 , x 2 , x 3 ) = x 3 , and suppose we want to estimate Note that f (x) = f (Π 2 x) for all x ∈ S, and hence the function f θ (given by (3.25)) coincides with the function f on the set Π 2 S. Therefore from Theorem 3.2 we obtain for large values of N 0 .We now demonstrate the usefulness of (5.123) in estimating S N0 γ2,θ (f, t).We will numerically show that S N0 γ2,θ (f, t) and S θ (f, t) are "close" to each other and the estimation of S θ (f, t) is far less computationally demanding than the estimation of S N0 γ2,θ (f, t).To estimate parameter sensitivities we will use the coupled finite difference (CFD) scheme developed in [1].In this method, the sensitivity value S N0 γ2,θ (f, t) is estimated by a finite-difference of the form for a small h, and the processes X N0 γ2,θ+h and X N0 γ2,θ are coupled together in a special way to reduce the variance of the associated estimator.Replacing derivative by a finite-difference introduces a bias in the sensitivity estimate, but we will ignore this issue here.Using CFD, we estimate S N0 γ2,θ (f, t) and S θ (f, t), with h = 0.01, t = 1, N 0 = 104 , θ = 1 and v 0 = 20.The results are reported in Table 2.The sensitivity values are written in the form s ± l, which means that the 95% confidence interval of the estimated value is [s − l, s + l].
For each estimation we use the minimum number of samples that is needed to ensure that l ≤ 0.05|s|, where | • | is the absolute value function.In the table, we also indicate the CPU time 4 (in seconds) that was needed for the estimation.The CPU time can be taken as a measure of the computational effort that was required to estimate the sensitivity value.Note that Table 2 shows that relation (5.123) holds but the time needed γ2,θ (f, t) for f : R 3 → R given by f (x 1 , x 2 , x 3 ) = x 1 .
In this case, f θ : Π 2 S → R can be computed as Hence Theorem 3.2 implies that As before we estimate S N0 γ2,θ (f, t) and S θ (f θ , t) using CFD, with h = 0.01, t = 1, N 0 = 10 4 , θ = 1 and v 0 = 20.The results are reported in Table 3.As before, Table 3 shows that S N0 γ2,θ (f, t) ≈ S θ (f θ , t) but the estimation This example clearly illustrates that our main result, Theorem 3.2, can be used to obtain enormous savings in the computational effort that is required for the estimation of parameter sensitivities for multiscale networks.
Lemma A.1 Fix a w 0 = (x 0 , y 0 ) ∈ S 1 × S 2 and let δ w0 ∈ P(S 1 × S 2 ) be the distribution that puts all the mass at w 0 .For any M ∈ N, let U M be the open set U M = {(x, y) ∈ S 1 × S 2 : x < M }.
Assume that the stopped martingale problem for (A, δ w0 , U M ) has a unique solution W M for each M .Let τ M be the stopping time defined by (A.2) with U replaced by U M .Then we have the following.(D) The martingale problem for A is well-posed.
Proof.Suppose that W M (t) = (X M (t), Y M (t)) for all t ≥ 0, where X M and Y M are processes with state spaces S 1 and S 2 respectively.Let q = max{ 1 d , ζ k : k = 1, . . ., K}.For a large M and a positive integer p define g ∈ B(S 1 ) by g(x) = x p ∧ (M + q) p .
Assume that x 0 p < M and note that the definition of g implies that for t ≤ τ M we have g(X M (t)) = X M (t) .Let f : S 1 × S 2 → R be the function given by f (x, y) = g(x).Then f ∈ D(A) and hence

Assumption 3 . 1 (
A) Parts (A) and (C) of Assumption 2.4 are satisfied.In addition, the mapping v → |H v | is polynomially growing (see Definition 2.1).

Proposition 4 . 1
The function β satisfies the following ordinary differential equation

Condition 4 . 3
For each N ∈ N, let f N be a function from R + to R m and let ǫ N = 1/ √ N .Then the sequence of functions {f N : N ∈ N} satisfies this condition if for any T > 0 lim N →∞ sup t∈[ǫN ,T ]

=
denotes equality in distribution.(B)Let f : S → R be a polynomially growing function with respect to projection Π 2 (see Definition 2.1).

Lemma 4 . 15
Fix a v ∈ Π 2 S and z ∈ H v .Then we have the following.
Recall that the set H v is finite due to part (A) of Assumption 3.1.Proposition 4.1 shows that the mapping t → β N v, z, e) by (4.66) we do the following.We set Θ N k,θ (t, v, z, z) = 1 and set Θ N k,θ (t, v, z, e) = 0 for all e ∈ H v − {z}.θ (t, v, z, e) is continuously differentiable, and hence the mappings t → ρ N k,θ (t, v, z) and t → Θ N k,θ (t, v, z, e) are also continuously differentiable.Lemma 4.10 Fix a v ∈ Π 2 S, z ∈ H v and t ≥ 0.
.81) Remark 4.14 Note that for any v ∈ Π 2 S and z ∈ H v , the mapping t → f N θ (t, v, z) is continuously differentiable with respect to t.Let ∂f N θ (t, v, z)/∂t denote the derivative of this map.Since f is polynomially growing with respect to projection Π 2 , the sequences of functions {f N θ : N ∈ N}, {∂f N θ /∂t : N ∈ N} and {B N θ f N θ : N ∈ N} are also polynomially growing with respect to projection Π S .
Remark 4.18In proving Theorem 4.3, we assumed that the set H v is finite for any v ∈ Π 2 S (see part (A) of Assumption 2.4).This means that if the state of the "natural" dynamics is v then the "fast" dynamics is constrained within a compact set H v .This assumption can be relaxed at the expense of making the proof more technical.The only place where finiteness of H v is crucial is in the proof of Proposition 4.4.As explained in Remark 4.7, this proposition can be extended for Markov chains with countable state spaces.Assuming the existence of a suitable Lyapunov function for the fast dynamics, the proof of Theorem 4.3 goes through with minor modifications.

Table 1 :
Example of Heat Shock Response Model Reaction Number Reaction Scaling Factor Propensity Function Vector 1 S 1 −→ S 2

Table 2 :
Estimation of sensitivity value for f (x 1 , x 2 , x 3 ) = x 3 Sensitivity Value Number of Samples CPU time (s) (f, t) is approximately 7000 times less than the time needed to estimate S N0 γ2,θ (f, t) .Now suppose we want to estimate S N0 θ

Table 3 :
Estimation of sensitivity value for f (x 1 , x 2 , x 3 ) = x 1 Sensitivity Value Number of Samples CPU time (s) (f, t) is around 15000 times slower than the estimation of S θ (f θ , t).