Strongly universally consistent nonparametric regression and classification with privatised data

In this paper we revisit the classical problem of nonparametric regression, but impose local differential privacy constraints. Under such constraints, the raw data $(X_1,Y_1),\ldots,(X_n,Y_n)$, taking values in $\mathbb{R}^d \times \mathbb{R}$, cannot be directly observed, and all estimators are functions of the randomised output from a suitable privacy mechanism. The statistician is free to choose the form of the privacy mechanism, and here we add Laplace distributed noise to a discretisation of the location of a feature vector $X_i$ and to the value of its response variable $Y_i$. Based on this randomised data, we design a novel estimator of the regression function, which can be viewed as a privatised version of the well-studied partitioning regression estimator. The main result is that the estimator is strongly universally consistent. Our methods and analysis also give rise to a strongly universally consistent binary classification rule for locally differentially private data.


Introduction
In recent years there has been a surge of interest in data analysis methodology that is able to achieve strong statistical performance without comprimising the privacy and security of individual data holders. This has often been driven by applications in modern technology, for example by Google [16], Apple [30], and Microsoft [11], but the study goes at least as far back as [35] and is often used in more traditional fields of clinical trials [32,8] and census data [25,14]. While there has long been an awareness that sensitive data must be anonymised, it has become apparent only relatively recently that simply removing names and addresses is insufficient in many cases [e.g. 29,26]. The concept of differential privacy [15] was introduced to provide a rigorous notion of the amount of private information on individuals published statistics contain. Statistical treatments of this framework include [36,23,2,6].
Although it is a suitable constraint for many problems, procedures that are differentially private often require the presence of a third party, who may be trusted to handle the raw data before statistics are published. To address this shortcoming, the local differential privacy constraint [see, for example, 21, 12, and the references therein] was introduced to provide a setting where analysis must be carried out in such a way that each raw data point is only ever seen by the original data holder. The simplest example of a locally differentially private mechanism is the randomised response [35] used with binary data, but mechanisms have also been developed for tasks such as classification [3], generalised linear modelling [12], empirical risk minimisation [33], density estimation [5], functional estimation [27] and goodness-of-fit testing [4].
Regression is a cornerstone of modern statistical analysis, routinely used across the sciences and beyond. We recall that, in a standard stochastic model, a regression estimator predicts for an observed d dimensional random feature vector an unknown random response, with finite second moment. The regression function, given by the conditional expectation of the response given the feature vector, achieves minimum mean squared error. Typically, the statistician does not know the underlying stochastic structure, but has access to a corresponding finite sample of independent identically distributed design-response vectors in R d ×R, and on this basis estimates the regression function. The background will be given below at the beginning of Section 2, and in the following we shall refer several times to the monograph of [19]. A binary classification (pattern recognition) rule predicts for a feature vector an unknown random response taking values in {−1, 1}. The so-called Bayes decision rule achieves minimum error probability (Bayes error). Given a finite sample of i.i.d. design-response vectors in R d × {−1, 1}, the Bayes rule is approximated. We formulate the setup in Section 5, while the monograph of [9] contains a detailed theory of nonparametric classification.
While regression has been relatively well-studied in the non-local model of differential privacy [e.g. 6], results in the local model are scarce. [37] studies sparse linear regression, kernel ridge regression and GLMs. [28,33] study parametric empirical risk minimisation. [34] studies sparse linear regression. [12,13] study GLMs. The recent work [17] concerns a relaxed version of the locally private regression model where responses can be observed exactly, and empirically studies a Nadaraya-Watson-type estimator, but we are unaware of any other work on locally private nonparametric regression. The simpler problem of binary classification is studied in [3], but there are significant additional challenges in designing a suitable estimator for the regression problem.
In this paper we introduce and investigate a new method for nonparametric regression under α-local differential privacy constraints and also present a corresponding classification rule. For regression our procedure combines a simple non-interactive privacy mechanism with a cubic partitioning regression estimate modifying the regressogram, which was originally introduced by [31] and has been well-studied since [see, e.g., 19, Chapter 4 and Section 23.1, and the references therein]. In Section 3 we describe the procedure and state that the sequence of estimates is strongly universally consistent, in that the L 2 -risk converges almost surely to zero in the large sample limit for any data-generating distribution for which the response has a finite second moment. Moreover, we give an upper bound on the rate of convergence of this estimator. Let us mention that in the degenerate case without privacy the estimator reduces to the strongly universally consistent partitioning estimator of [18]. The problem of classification is strictly easier than regression, therefore our methods and analysis also give rise to a strongly universally consistent binary classification rule for locally differentially private data.
The remainder of the paper is organised as follows. In Section 2 we introduce the necessary background on regression and local differential privacy. In Section 3 we introduce our privacy mechanism and estimators, and state our main results in the regression setting, discussing their implications for local differential privacy in Section 4. In Section 5 we study the consequences of the results for binary classification. All proofs will be deferred to Section 6.

Preliminaries
Let (X, Y ) be a pair of random variables such that the feature vector X takes values in R d and its response variable Y is a real-valued random variable with E[Y 2 ] < ∞. We denote by μ the distribution of the feature vector X, that is, for all measurable sets A ⊂ R d , we have μ(A) = P{X ∈ A}. Then the regression function is well defined for μ-almost all x. For each measurable function g : therefore, with the notation We measure the performance of an estimatorm of m through the loss function which, by (2), may be interpreted as the excess prediction risk for a new observation X.
In this paper we are mainly concerned with regression estimatesm based on partitions of the sample space, which were originally studied by [31]. Let The raw data will be independent and identically distributed copies of the random vector (X, Y ), and the estimators that we consider will be (randomised) functions of the binned data, defined by Using this binned data, when we do not have to satisfy privacy constraints, one may create a scheme for a public data set as follows: there are n individuals in the study such that individual i generates the sample pair (X i , Y i ) and he submits the discretised version (Q h (X i ), Y i ) to a data collector. The data collector calculates the empirical distributions Then, the public data set is published. The data set D n,h has the favourable property that with high probability the size #(D n,h ) is much less than n [cf. 24].
Using this binned data and allowing h = h n to depend on the sample size, the partitioning regression estimate is defined by where 0/0 is 0 by definition and I denotes the indicator function. In order to have strong universal consistency, [19] modify the partitioning regression estimate as follows: There is a huge literature on weak and strong universal consistency of regression estimates. Weak universal consistency means convergence for any distribution of (X, Y ) with EY 2 < ∞. For the weak universal consistency of local averaging regression estimates m n , which includes partitioning estimates, kernel estimates and nearest neighbor estimates, we refer to Chapters 4-6 in [19].

Our regression estimation method and its strong universal consistency
Similarly to [3] we consider locally privatised data given as follows: the privacy mechanism is formulated by independent double arrays { i,j } and {ζ i,j } such that the elements of the arrays are i.i.d. with centred, unit-variance Laplace distributions. For i = 1, . . . , n and for Choose a sphere S n centered at the origin. Assume that the cells A h,j are numbered such that A h,j ∩ S n = ∅ when j ≤ N n for some integer N n > 0, and A h,j ∩ S n = ∅ otherwise. Individual i ≤ n generates and transmits the data and where σ Z > 0 and σ W > 0. This means that individual i generates noisy data for any cell A h,j with j ≤ N n . Proposition 1 in Section 4 shows that, for suitable choices of σ W and σ Z , this mechanism satisfies the α-LDP constraint. For such σ W , σ Z , the data set Now that we have introduced our privacy mechanism we may define our estimator of m based onD n,h . For c n > 0 we definẽ This is a novel estimator that extends the classical partitioning regression estimate to the LDP setting. In non-private settings such estimators may be seen as averaging the value of the response over each element of the partition, but here we are unable to retain this interpretation as we cannot know exactly how many data points fall in each cell. This lack of knowledge is particularly problematic in low-density regions, where the estimate of μ is necessarily especially noisy, and where our estimator must be carefully defined. A crucial component of the estimate is the way it detects the empty cells and truncates. If X has a density, then μ(A hn,j ) is of order h d n . Furthermore, on the support of an arbitrary μ, μ(A hn,j )/h d n is bounded away from zero. More precisely, if A n (x) stands for the cube A hn,j containing x, then for μ-almost all x, at least when we have a nested sequence of partitions, see Lemma 24.10 in [19]. Thus, for arbitrary μ, the order of μ(A hn,j ) is at least h d n . Therefore, c n → 0 implies that μ(A hn,j ) > c n h d n , for large enough n. When σ W = σ Z = 0 and c n = log n/(nh d n ) then we recover the non-private partitioning estimator, which has access to the raw data, discussed above.
Our first main new result extends Theorem 1 to the private setting where σ W , σ Z > 0 are fixed, and establishes the strong universal consistency ofm n .
for any distribution of (X, Y ) with EY 2 < ∞.
The proof of Theorem 2 shows that replacement of (8) by nc 2 n h 2d n → ∞ yields the weak universal consistency ofm n .
Comparing with Theorem 1, we see that that the usual condition nh d n → ∞ has been replaced by nh 2d n → ∞. Heuristically, this difference can be understood by considering the properties ofν n (A hn,j ). Writing ν( which is the same as in the non-private case. However, we see a difference when we consider that In the non-private case, the only contribution is from the first term, which can be seen to typically be O(h d n ). However, in the private case we will usually take σ Z to be large, and hence the variance in (10) is dominated by the second term, which does not vanish with h n . This occurs in other LDP problems [e.g. 4]; the privacy constraint introduces an unavoidable homoscedastic term into the variance of our estimator, which results in very different behaviour, including a curse-of-dimensionality that is often more severe than in non-private problems.
The proof techniques used for Theorem 2 can be used to derive upper bounds on the rates of convergence of our estimator for suitable data-generating mechanisms.
Y is bounded and X has a density, which is bounded away from zero, then

For the choices
and this upper bound results in We conjecture that is the minimax lower bound over all α-LDP privacy mechanisms for Lipschitz continuous regression function, which would imply that our estimate is minimax optimal up to a factor of log n. Furthermore, the lower bound on the density appears to be crucial; we speculate that if the density is not bounded away from zero, then the rate of convergence of any estimate can be arbitrarily slow.

Local differential privacy
As discussed above, when working under privacy constraints, no estimator can have direct access to the raw data D n , or even the binned data D n,h . Instead, it will only be allowed to depend on randomised data (Z 1 , . . . , Z n ), defined on some measurable space (Z n , B n ), that has been generated conditional on D n . Formally, a privacy mechanism is a conditional distribution Q : y 1 ), . . . , (x n , y n )}} ∼ Q (·|(x 1 , y 1 ), . . . , (x n , y n )).
This privacy mechanism will be said to be sequentially interactive [12] if it respects the graphical structure In particular, this requires that . . , Z i−1 } for any j = i, so that Z i is generated with only the knowledge of (X i , Y i ) and Z 1 , . . . , Z i−1 . For this reason, such privacy mechanisms are said to be locally private. Sequentially interactive privacy mechanisms may be specified by a sequence of and with the interpretation that Given α > 0, a sequentially interactive mechanism specified by (Q 1 , . . . , Q n ) is said to be α-locally differentially private (α-LDP) if for each i = 1, . . . , n. Let Q α denote the set of all α-LDP privacy mechanisms. Our privacy mechanisms, given by (5) and (6), are actually of a simpler, for all j = i. In this case we have for all (A 1 , . . . , A n ) ∈ B n . Such mechanisms satisfy the α-LDP constraint if and only if for each i = 1, . . . , n. Non-interactive mechanisms are computationally attractive in practice as they require minimal communication between the statistician and the orginal data holders, and in large-scale applications there are many practical barriers to interactivity [20].
The following result studies the local differential privacy of the mechanism given by (5) and (6) in the case that N n = ∞, but it is a straightforward consequence of this that the mechanism satisfies the same bound when N n < ∞. (5) and (6)  Given α > 0, we can therefore ensure that our privacy mechanism is α-LDP by choosing M, σ W , σ Z such that 2 3/2 (1/σ W + M/σ Z ) ≤ α. This is satisfied if, for example, we take σ 2 W = 32/α 2 and σ 2 Z = 32M 2 /α 2 . In problems of differential privacy one often wants to work in a high-privacy regime, where we have α → 0 as n → ∞. With our privacy mechanism, this requires that min(σ W , σ Z /M n ) → ∞, and so we remark that Theorem 2 can easily be extended to the setting in which the variances σ 2 Z and σ 2 W may depend on the sample size n. Replacing the condition (8) with

Proposition 1. Consider the privacy mechanism defined in
a straightforward extension of the proof of Theorem 2 implies the strong universal consistency. Choosing σ Z M n /α, with M n → ∞ and (log n) 3 max(1, M 2 n /α 2 ) nc 2 n h 2d n → 0, then our mechanism satisfies the α-LDP constraint and the strong universal consistency holds.

Consequences in classification
For the setup of binary classification, let the feature vector X take values in R d , and let its label Y be ±1 valued. If g is an arbitrary decision function then its error probability is denoted by The Bayes decision rule g * , given by where sign(z) = 1 for z > 0 and sign(z) = −1 for z ≤ 0, minimises the error probability. Let L * = P{g * (X) = Y } denotes its error probability. For privatised data, the partitioning classification rule is defined by Note that this rule does not use the data {W i,j }. Under the conditions lim n→∞ h n = 0 and lim n→∞ nh 2d n = ∞, [3] showed that the partitioning classification rule g n is weakly universally consistent, i.e., lim n→∞ E{L(g n )} = L * for any distribution of (X, Y ). Our work here allows us to strengthen this result to the following theorem on strong universal consistency: The rates of convergence of the classification rule g n , over classes of datagenerating mechanisms satisying Hölder continuity and a strong density assumption, were established in [3], and were moreover shown to match a minimax lower bound. We remark that, even in the non-private case, the absence of the strong density assumption leads to slower rates of convergence [22,1,7].

Lemma 1.
Proof. Taking t = nε/2 and using the fact that log(1−x) ≥ −2x for x ∈ [0, 1/2], we have An analogous bound holds for the lower tail of the distribution, and the result follows.
with a universal constant c * < 5.1.
Proof. Applying Jensen's inequality, this lemma is a special case of Lemma 4.4 in [10].

Proof of Theorem 2. We use the decompositioñ
where for x ∈ A hn,j we write It suffices to show that lim n→∞ m n (x) 2 μ(dx) = 0 a.s., (12) and But (4) implies that for any distribution of (X, Y ) with E(Y 2 ) < ∞, and in order to prove (13) it therefore suffices to show that for any distribution of (X, Y ) with E(Y 2 ) < ∞.
Proof of (15). If in the definition of m n we modify ν n such that then a slight modification of the proof of Theorem 23.3 in [19] together with the condition M n → ∞ implies (4), too. We have that Note that Let A n (x) denote the cube A hn,j , which contains x. Then, Define the notation so that μ * and μ * n are bounded measures. Thus, a very similar argument to that used to prove (16) shows that where we use the fact that Then the Cauchy-Schwarz inequality, (14) and (17) imply The fact that G n → 0 a.s. follows from (14). We now turn to H n . Since we have m(x) 2 μ(dx) < ∞, it suffices to show that i.e., By the inequality . (16) implies that the first term tends to 0 a.s. Concerning the second term, we observe that, by the fact that Bernoulli random variables are subgaussian with variance proxy bounded by 1/4, there exists L > 0 such that for any q ∈ N we have Thus, the second term tends to zero a.s. by using a very similar argument to that used to prove (16). Finally, the third term is non-random. Let S be a sphere centred at the origin such that μ(S c ) ≤ ε, and set If λ denotes the Lebesgue measure, then Thus, we proved that H n → 0 a.s.

Proof of Theorem 3.
Since Y is bounded, we may assume that n is sufficiently large that [Y ] Mn −Mn = Y almost surely. We use the decompositioñ where for x ∈ A hn,j we recall from the proof of Theorem 2 that It suffices to show that and But, recalling the definition ofm n from (3), Theorem 4.3 of [19] implies that and in order to prove (19) it therefore suffices to show that Proof of (18). We have that Locally private nonparametric regression

2447
Proof of (21). We have that where we recall that if n is large enough. Set Letting L denote a bound of |Y |, we have Note that Let A n (x) denote the cube A hn,j , which contains x. Then, We have that We now turn to G n . By the inequality On the other hand, if x ∈ A h,j and x ∈ A h,j with j = j , then we have It therefore follows that f W,Z|X,Y (w, z|x, y) f W,Z|X,Y (w, z|x , y ) ≤ exp 2 3/2 /σ W + 2 3/2 M/σ Z , as required.
Proof of Theorem 4. For the notation m n (x) =ν n (A hn,j ) μ(A hn,j ) when x ∈ A hn,j , the rule g n has the equivalent form g n (x) = signm n (x).
Theorem 2.2 in [9] implies that By Theorem 23.1 in [19], the first term tends to 0 a.s. Similarly to the previous proof, given > 0 let S be a sphere centred at the origin such that μ(S c ) ≤ ε, and set where the last step follows from the condition nh 2d n / log n → ∞. Therefore, by Markov's inequality and the Borel-Cantelli lemma, we have proved that lim sup n [|m n (x) −m n (x)|] 1 0 μ(dx) ≤ 2λ(S)ε + ε a.s. Since > 0 was arbitrary, this completes the proof.