Data Confidentiality: a Review of Methods for Statistical Disclosure Limitation and Methods for Assessing Privacy *

There is an ever increasing demand from researchers for access to useful microdata files. However, there are also growing concerns regarding the privacy of the individuals contained in the microdata. Ideally, mi-crodata could be released in such a way that a balance between usefulness of the data and privacy is struck. This paper presents a review of proposed methods of statistical disclosure control and techniques for assessing the privacy of such methods under different definitions of disclosure.


Introduction
Article 12 of the Universal Declaration of Human Rights (General Assembly of the United Nations, 1948) states: "No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honor and reputation.Everyone has the right to the protection of the law against such interference or attacks."As such, with privacy being viewed as a basic human right by the United Nations, data releasing agencies must make every effort possible to maintain high levels of privacy for the individuals who entrust their data to an agency.
What exactly is meant by privacy?Given a piece of information about an individual, one person may wish to keep that data private while another individual may not particularly care about that specific piece of information.This leads to a good definition of privacy.Fellegi (1972, page 7) used the definition of privacy provided by Professor Weston of Columbia University which defines privacy as the right "to determine what information about ourselves we will share with others."Privacy considerations of microdata are an increasingly important issue.The amount of data being produced everyday pertaining to individuals is unprecedented.Between medical, educational, and human services records, large amounts of data are produced.These types of data are invaluable to researchers in a vast array of fields, driving demand for this data.However, this raw data cannot simply be released to the public for study due to these privacy concerns.
Many agencies rely on publicly released data from the census, and numerous public policy research projects depend on publicly available medical or educational data sets.Further, agencies like the U.S. National Institute of Health (NIH) urge its data collecting grantees to release their data for public use, but they require that this be done in a private way.They state: "In NIH's view, all data should be considered for data sharing.Data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and proprietary data.To facilitate data sharing, investigators submitting a research application requesting $500,000 or more of direct costs in any single year to NIH on or after October 1, 2003 are expected to include a plan for sharing final research data for research purposes, or state why data sharing is not possible." Often times, the most interesting data for research can be extremely sensitive information about an individual that must remain private for ethical or even legal reasons (e.g.Health Insurance Portability and Accountability Act (HIPAA), Family Educational Rights and Privacy Act (FERPA)).HIPAA creates a legal protection for individuals who wish to keep their medical records private, whereas, FERPA provides individuals with legal protection of their educational data.Data collecting organizations have a further incentive to maintain the privacy of their respondents' data that goes beyond ethics or the law: If respondents feel that their data are at risk for disclosure, they may be less likely to be completely honest in their responses.This may cause respondents to alter responses or simply not respond at all to some surveys.Therefore, trust between a data collecting agency and its respondents is very important.
Ideally, any useful collected data set could be released to the public for research with the implicit trust that that the data would not be used for inappropriate purposes.However, groups or individuals often have incentives to use data maliciously.For example, in 1995, prior to the passage of HIPAA, Woodward (1995) described a case involving a banker from Maryland who obtained a list of patients with cancer.Using the list of patients with cancer along with a list of clients with outstanding loans, the banker sought to match individuals across both lists.When a match was found, he then called in the loans of the clients who had cancer.Today, with the regulations of HIPAA, private medical information cannot simply be released to the public.As such, institutions that wish to release sensitive data must take steps to protect the identity of the individuals in the data.
The first, most basic step in maintaining privacy is to remove variables such as name, social security number, and home address.Agencies strive to do their best to de-identify the data so that the privacy of the individual remains in-tact, while still providing researchers with useful data with which they can use to make useful, correct conclusions.However, simply removing these obvious identifiers is not always enough to maintain the privacy of an individual.For instance, several years ago the Massachusetts Group Insurance Commission released data to the public for research that was stripped of obvious identifers.Sweeney (2002b) used this data, along with publicly available voting records, to identify the released medical information or former Massachusetts governor William Weld.
Sweeney (2002b, page 2) went on to say "...87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}.Clearly, data released containing such information about these individuals should not be considered anonymous.Yet, health and other person-specific data are often publicly available in this form."Thus, simply removing obvious identifiers from the data is not always adequate to maintain the privacy of the individual.More rigorous procedures are required to achieve privacy.
It is this type of disclosure, from what Sarathy and Muralidhar (2002a) referred to as "snoopers", that is discussed here.(As opposed to, say, privacy breaches from unauthorized users of a database (hackers).)Sarathy and Muralidhar (2002a, page 1) stated: "The security threat posed by snoopers generally takes the form of undesired inferences about confidential data using other data available either within or outside the database." We view all data discussed as rectangular data with each row representing an observation and each column representing a variable, however, the rectangle need not be complete.For some methods, rectangular data is expressed in tabular format, and the discussed techniques for tabular data would be applied.
While we consider this to be a thorough review, the breadth of the topic is vast, and we do not attempt to cover all papers on the topic.Another very good review of disclosure control techniques which protect against this type of disclosure can be found in Skinner (2009).
This manuscript discusses methods for limiting statistical disclosure in Section 2. Section 3 discusses measures for assessing the privacy of statistical disclosure control techniques, while Section 4 concludes the manuscript with a summary.

Releasing microdata to the public in a private way
Microdata are data containing observations on individual level.When this type of data is released for research purposes the very first action taken to maintain confidentiality is the removal of obvious identifiers such as name, address, social security number, zip code, etc.However, as mentioned above, this is not always enough to protect the privacy of the individual from an inferential disclosure which can occur, for example, when an individual in the released microdata has some outlying or unique trait (e.g. a very large income, a rare occupation).
In this section, we discuss different proposed privacy preserving techniques for releasing data for research.We start by discussing basic privacy preserving methods employed by agencies for releasing data.This is followed by several other proposals for maintaining privacy, including matrix masking, data swapping, and synthetic data.Adam and Worthmann (1989) and Duncan and Pearson (1991) both presented good reviews of some of the methods mentioned in this section.

Basic methods for limiting disclosure risk
After removing obvious identifiers, some of the most basic methods for maintaining privacy of publicly released data sets employed by data releasing agencies (e.g.The U.S. Census Bureau) include limitation of detail, top/bottom coding, cell suppression, and rounding.

Limitation of detail: This technique includes recoding variables into in-
tervals and collapsing together categories in which only a small number of observations appear.For example, the U. S. Census does not release geographic identifiers that would leave a sub-population with less that 100,000 observations (Moore, 1996) 2. Top/bottom coding: This technique can help reduce the disclosure risk of extreme values in the data by limiting the largest (or smallest) value possible for a given variable.For example, if an individual has an extremely large salary, rather than reporting the exact amount, which would make the observation vulnerable to disclosure, an agency may simply report it as "over $100,000".Likewise, negative values of income could be recoded to be "less than $0" to avoid extremely large negative values.3. Suppression: In a contingency table, cells with too few observations cannot be released to the public, as it may be easy to to infer the identity of these individuals.A simple procedure for controlling disclosure is suppression of these cells.Similarly, if the values of some combination of variables are unique or nearly unique in the data, the identity of these rare combination may be easily de-identified.Therefore, these observations could be suppressed as one possible method for maintaining confidentiality.(Cox, 1980, 1984, Mugge, 1983, Cox et al., 1987) 4. Rounding: Rounding is another method to limit statistical disclosure of data.Random rounding involves deciding on a rounding base and then rounding each observation up or down to the nearest multiple of the rounding base.Rounding up or down is decided upon randomly based on how close the observation is to the nearest multiple of a rounding base.For example, if the rounding base is 10 and 7 was observed, 7 would be rounded up with probability 0.7 and rounded down with probability 0.3.One could also use controlled rounding which allows the sum of the rounded values to be the same as the rounded value of the sum of the original data.(Cox, 1984, Cox et al., 1987, Cox, 1987) 5. Addition of noise: Rather than release the actual values of the data, noise is added to the data in an attempt to prevent a linkage attack from occurring.The perturbed data can be correctly analyzed by accounting for the extra variability from the added noise.For continuous data, noise addition is discussed in Fuller (1993) and for discrete data a technique called the Post Randomization Method (PRAM) Gouweleeuw et al. (1998) can be applied.

Sampling
Sampling is a very powerful tool in limiting disclosure risk of released microdata files, especially against linkage attacks.For instance, a malicious user may try to match an observation in a released set of microdata to another observation in a data set which could identify the individual.However, simply by matching a record in the released data file does not mean that the match is correct.Skinner et al. (1994) pointed out that "Population uniqueness will be a sufficient condition for an exact match to be verified as correct."If the released microdata are a sample, this make it difficult to verify population uniqueness and is one of the key benefits of sampling.
Other benefits of sampling as method of disclosure control are that it is easy to implement and the resulting sampled data are relatively easy to analyze.(1980) and Cox (1994) proposed a statistical disclosure limitation (SDL) method called matrix masking.Consider an n by p data matrix, X, consisting of n observation and p variables.Rather than release the data X, one could release the data Y = AXB + C where A, B, C are appropriate conformable matrices.By properly defining the matrices A, B, and C, special cases of matrix masking include: noise addition (Fuller, 1993), sampling, suppressing sensitive variables, cell suppression, and addition of simulated data.

Cox
A drawback to matrix masking is that in order to analyze the data, the analyzer must have knowledge of the masking procedure used, and, often, even if the consumer knows the masking procedure, the analysis of the data can be complex and special software may be needed.Analysis of masked data is discussed in Little (1993).Kim (1986) proposed to protect microdata via the addition of noise and transformation.Using their notation, for a data set, x, consisting of n observations and p variables.Kim (1986) suggested masking the j-th variable, x j by adding noise, e j , from a normal distribution or from the distribution of x j itself.Thus the masked, released data for the i-th observation of the j-th variable, y ij will be x ij + e ij where i = 1...n and j = 1...p.Kim (1986) further suggests a transformation after the addition of noise of the form z ij = ay ij + b j where a and b j are chosen subject to constrains on the first and second moments of z j and y j .b j is chosen such that E[x j ] = E[z j ] and a can either be chosen so that V ar[x j ] = V ar[z j ] or based on the specific confidentiality requirements of the application.
While Kim (1986) discussed many of the properties of this masking procedure, it does not explore the degree to which disclosure is limited leaving this as a topic for future work.It notes, however, that by properly controlling the value of a in the transformation the probability of re-identification can be raised or lowered as appropriate.Further, when using this method the bivariate relationships remain intact and common analyses, like regression, for example, using the transformed data will perform well.Bowden and Sim (1992) introduced what they refer to as the privacy bootstrap.Rather than adding random noise from a known distribution, the added noise is based on the empirical distribution of the data via a bootstrapping procedure.Consider the actual data to be x i , i = 1, ..., N with mean x and let x ⋆ i be a randomly sampled observation from the collection of x i , i = 1, ..., N where each observation has probability 1 N of being selected.Then the released data would be One could also choose to release data as i with α and β chosen based on the specific situation.

Randomized response and Post Randomization Method (PRAM)
Randomized response (Warner, 1965, Greenberg et al., 1969) is a technique used in surveys when the questions being posed are of a sensitive nature (Suppose an interviewer was asking about illegal activity which, in turn, may make the respondent more likely to lie or simply refuse to respond.).The basic idea is that a respondent answers a question truthfully with some probability p or answers the question untruthfully with probability 1 − p.In this way, the survey taker does not know for sure whether the respondent is telling the truth or not and a level of confidentiality is maintained.Surveys with randomized response were originally proposed to remove the effect of response bias in surveys that ask sensitive questions.
By using this technique respondents privacy is protected, since, even if an individual is identified by a data snooper, they cannot be sure whether the response is correct or not.For example, when administering a survey a researcher may ask a question which would easily identify the respondent, such as asking about a rare condition or disease.After the question is asked, the respondent flips a coin and, for example, tells the truth when heads is observed and lies when tails in observed.In this way, even the raw microdata maintains a level of confidentiality.This method could also be applied after raw microdata were collected.For each observation, the real value of a sensitive field would be released with some probability and its opposite would be released with some other probability.Either way, in order to analyze this data, the researcher must have information about the randomization mechanism.Gouweleeuw et al. (1998) introduced Post Randomization Method which is used to protect categorical data from disclosure.PRAM perturbs each record in a data file using some probability distribution.This essentially amounts to the addition of noise for categorical variables.One important distinction between PRAM and randomized response is that in randomized response the random mechanism is independent of the true score and applied at the time of collection.However, with PRAM the true value is known and one can therefore condition on this value when defining the probability mechanism used to perturb the data.

Data swapping and data shuffling
Data swapping was first proposed by Dalenius and Reiss (1982) as a method of disclosure limitation.The proposed procedure was intended to be used for contingency tables within a database.Then, as the name implies, the data are swapped in such a way as to maintain the marginal counts of the table.The swapping procedure adds a layer of protection, while the marginal counts remain intact.Dalenius and Denning (1982) also suggest the possibility of releasing the moments of continuous data rather than the data itself.Moore (1996) identified several desirable properties of data swapping.First, the procedure allows information about each respondent to be masked.Also, swapping only needs to be performed on sensitive variables in order to remove the relationship between the record and the respondent.This leaves nonsensitive variables undisturbed.Finally, as a practical consideration, Moore (1996) noted that the procedure is easy to implement, requiring only a microdata file and a random number generator.
If one is simply interested in univariate statistics, this procedure works very well, however, one drawback to the procedure is that it may not maintain multivariate relationships.Also, it is likely that analysis of sub-populations may be affected by the swapping procedure.It is also possible that the swapping may result in nonsensical combinations.For example, if your data contains gender and type of cancer, after a swap, the resultant data may contain a record indicating there is a female with prostate cancer.
Tables 1A and 1B offer an example of the implementation of data swapping from Fienberg and McIntyre (2004).Table 1A contains the original unperturbed microdata, while Table 1B displays the data after data swapping has occurred.Here, X is the sensitive variable, so data swapping is only performed on X, while Y and Z remain the same in the original and swapped data.
While the original intention of data swapping was to be used for releasing contingency tables of the swapped data, the problem can be extended for microdata.However, if one wishes to release microdata many more swaps must be made to preserve the level of privacy.Identifying the correct number of swaps, Fienberg and McIntyre (2004) noted, is "computationally impractical."As such, it is suggested that the counts be preserved only approximately.This idea is discussed in detail in Reiss (1984).Also, data swapping makes it very difficult to maintain weighted counts when the weights are unequal, which occurs often in surveys.
Table 1 An example of data swapping.(A) contains the unswapped, original values of the data.(B) presents the data after data swapping.Liew et al. (1985) proposed a swapping method where the released data are random draws and not the original variables.This method requires the identification of the univariate distribution of each variable which is considered to be sensitive for release to the public.Carlson and Salabasis (2002) proposed a procedure that they refer to as the C&S method which offers an improvement to the proposal put forth in Liew et al. (1985).While they show that their swapping method maintains a large amount of utility, they make no claims or observations as to the confidentiality of their method.Later, Sarathy and Muralidhar (2002a) showed that the C&S method has almost no desirable properties as a method for limiting statistical disclosure.Moore (1996) outlined a method called rank based proximity swapping which was proposed in an unpublished article in Greenberg (1987).This procedure can be used for masking data as long as the variables of interest are continuous in nature.The main difference between the Greenberg (1987) swapping procedure and Dalenius and Reiss (1982) proposal is that the range over which the data can be swapped is restricted.The advantage here is that by limiting what values can be swapped with other values, many of the multivariate relationships can be more appropriately maintained, whereas with Dalenius and Reiss (1982) swapping, these relationships may be lost.Sarathy and Muralidhar (2002a) went on to propose a further method called data shuffling based on the conditional distribution approach.In proposing this, they seek a method that performs as well as data swapping, but without the inherent disclosure risks.Under their method, as in data swapping, all of the marginal distributions remain intact.They also show that pairwise monotonic relationships in the original data are maintained in the released shuffled data.They also note that releasing shuffled data does not increase the possibility of disclosure even when shuffled microdata are released.

Synthetic data
Synthetic data, first proposed by Rubin (1993), is a method of statistical disclosure limitation based on the missing data technique multiple imputation (Rubin, 1987, Little and Rubin, 1987, Schafer and Graham, 2002, Harel and Zhou, 2007).The idea is to view sensitive data as missing values and replace them using multiple imputation techniques.Thus sensitive attributes would be replaced by random draws from an appropriate posterior predictive distribution.
One can think of the observed microdata as a random sample of size n from a population P of size N .The population is made up of background variables of interest, X = (X i , i = 1, 2, ..., N ), which might include name, birthdate, address, etc. and survey variables of interest, Y = (Y i , i = 1, 2, ..., N ).We randomly take a sample of size n from the population which yields Y inc = (Y inc,i , i = 1, 2, ..., n).Therefore, the observed microdata D consists of the background variables, X, and the observed survey variables, Y inc .The remaining N − n unsurveyed individuals make up Y exc = (Y exc,j , j = n + 1, n + 2, ..., N ).Next, multiple imputation is used to replace Y exc with plausible values.These imputations are drawn from the posterior predictive distribution P r(Y exc |Y inc , X).This process is repeated M times, each time creating a synthetic population P (l) of size N with l = 1, 2, ..., M .A random sample of size k is then drawn from each synthetic population, P (l) , yielding D (l) with l = 1, 2, ..., M .Thus the released fully synthetic data are D syn = (D (l) , l = 1, 2, ..., M ).
The data releasing agency may want to take privacy a step further and release neither X nor Y inc .In this case, Raghunathan et al. (2003) recommended creating a "future" population by randomly drawing from a posterior predictive distribution of P r(X f , Y f |X, Y inc ) where X f , Y f are random variables representing a "future" population.In this case, no actual data are released, which makes linkage attacks very difficult.Fully synthetic data sets are discussed in Raghunathan et al. (2003), Rubin (1993), Reiter (2002Reiter ( , 2004aReiter ( ,b, 2005b) ) and Matthews et al. (2010b).
Alternatively, an agency could employ partially synthetic data techniques as proposed in Little (1993).Rather than replacing all of the data with imputations, as is the case in the fully synthetic framework, only sensitive attributes are replaced with imputations.This can be done in many ways, including determining that an entire variable or variables must remain private or by selecting individual attributes that are at high risk of disclosure.Once an agency decides what values must remain private, they consider those values to be missing and replace them using multiple imputations techniques.This creates M partially synthetic data sets which will be released.Each partially synthetic data set consists of the non-sensitive data, which will be the same across all M synthetic data sets, and the imputed values of the sensitive data.Partially synthetic data methods are discussed in Kennickell (1997), Abowd and Woodcock (2001), Liu and Little (2002), and Reiter (2003Reiter ( , 2005c)).
One big advantage of synthetic data is the ease with which the data can be analyzed.In classic multiple imputation, each imputed data set is analyzed using a complete data technique and the inferences are combined using the appropriate combining rules (Rubin, 1987).Analysis of synthetic data sets is performed in a similar fashion.Each synthetic data set is analyzed using a complete data technique and inferences are combined using the appropriate combining rules, which are slightly different than that of classic imputation.The combining rules for fully and partially synthetic data are set forth in Raghunathan et al. (2003) and Reiter (2003), respectively.
A hurdle that must be overcome in dealing with synthetic data is convincing researchers that analyzing data that is not "real" has merit.Evidence demonstrating the usefulness of synthetic data is presented in Raghunathan et al. (2003) and Reiter (2005b) where they show that if the imputation model specified is accurate, many resulting analyses based on the synthetic data set will be virtually identical.However, if the model for imputation is incorrect or inaccurate, the resulting analysis from the synthetic data will yield parameter estimates that are much different than those estimated from the actual data (Reiter, 2005b, Matthews et al., 2010b).As such, synthetic data sets are only as good as the models used for imputation.
2.6.Other selected privacy preserving methods 2.6.1.Slicing, micro-aggregates, and recombination Paass (1988) suggested slicing, micro-aggregates, and recombination as methods of controlling statistical disclosures.The first method involves taking a set of complete records and slicing them into groups, each of which would have a smaller number of variables.Then each slice is released separately.Microaggregation involves creating new records by averaging at least three original records.The third proposal is what they refer to as recombination.This involves dividing each record into sub-records consisting of several variables each.Then the sub-records are recombined across different individuals to create synthetic records.In order to retain the original relationships between the variables, recombinations are done in such a way that that "...only those subrecords whose underlying complete records were of similar structure were recombined (Paass, 1988, Page 493)." Slicing separates the variables into smaller groups, rather than releasing all of them at once.This method maintains suitable levels of confidentiality, however, lacks utility for more complicated analyses.Micro-aggregation, creating new records by averaging at least three records, was found to be unsuitable, as the resulting utility of the data was substantially diminished by this disclosure avoidance technique.
The use of recombination is shown to be the best of the three methods evaluated here.This method consists of decomposing each observation into 8 subrecords of between 5 and 15 variables then the sub-records are combined with other sub-records using statistical match until all of the sub-records have been recombined.They show that this method is safe against usual disclosure attempts using additional information and that many common analyses provide suitable results.Krumm (2007) looked at inference attacks on location data collected from commuters global positioning system (GPS).They showed that they can make rea-sonable inferences as to where the commuter lived, based solely on the data, creating a possible breach of privacy.They offer several possible methods for increasing privacy, including: spatial cloaking, addition of noise, and rounding.Spatial cloaking involves suppressing all of the points within a circle surrounding the house of a commuter.However, the center of the circle is randomly chosen within some bounds near the house because simply centering the circle exactly at the location of the house would make it very easy for an intruder to elicit the exact location of interest.The addition of noise involves adding 2-dimensional noise to each point obscuring the exact location of the commuter.The third method involves rounding each point the nearest point of a grid.The coarseness of the grid can be adjusted to add more or less privacy.Armstrong et al. (1999) also discussed protecting geographic data released to the public.

Location data
2.6.3.Scrub system, Datafly, Argus, and SUDA2 Sweeney (1996) proposed the Scrub system.This algorithm scans through personal medical records to locate information which could be used to identify the owner of the records.Words or phrases which would put the owner of the record at risk are identified and replaced with a "pseudo-value".However, even after locating and replacing these identifying words and replacing them, anonymity still cannot be guaranteed.Sweeney (1996, page 5) noted "Even then however, we still cannot scrub implicit information where an overall sequence of events whose preponderance of details identify a particular individual.This is often the case in mental health data and discharge notes".
Argus (De Waal et al., 1995, Hundepool et al., Feb. 2005) is a software package for limiting the risk of statistical disclosure.The goal of Argus is to limit the occurrence of rare combinations of identifying variables, thus lowering the risk of disclosure.This is achieved using global recoding and local suppression.Global recoding involves combining several categorical variables into one.Therefore, for instance, rather than releasing the city or town that someone lives in, individuals within the data could be grouped into county or even state.This makes it more difficult to identify an individual in the data.Following global recoding, suppression is used to remove combinations of identifying variables that still appear in rare combinations.Sweeney (1997) proposed the Datafly system.This system processes specific queries made to a database.Then the query results are returned, subject to a specified level of privacy between 0 and 1.A specified level of 0 would return the raw data from the query, and at level 1, the data would be generalized as much as possible.Privacy is achieved in two ways: 1) Data are returned in bins rather than in raw form and 2.) data which fit into a bin with too few observations is simply not returned in the query results.These two steps are essentially global recoding and local suppression as specified by Argus.
The algorithm SUDA2 (Special Unique Detection Algorithm) (Manning et al., 2008) is another useful piece of software for statistical disclosure control.This algorithm searches a data set for unique observations.One benefit of SUDA2 that the authors note is that SUDA2 allows "significantly more columns to be addressed."This is important as the the number of potential variables in data sets can be quite large.

Micro-agglomeration, Substitution, Subsampling, and Calibration (MASSC)
Micro-agglomeration, Substitution, Subsampling, and Calibration (MASSC) (Singh et al., 2003) is a combination of several individual statistical disclosure techniques.The procedure proceeds in four basic steps: micro-agglomeration, substitution, subsampling, and calibration.Micro-agglomeration refers to placing records into groups depending on the level of assessed risk.Identifying variables are broken into two categories, core and non-core variables.Core identifying variables are variables that will be easily available to an intruder, while non-core identifying variables will be less readily available.Records at the highest risk level are records that are unique in terms of core variables, while the lowest level of risk includes records which are not unique in terms of both types of variables, core and non-core.Once records have been grouped into risk categories, disclosure control techniques are applied.First, substitution techniques are used to perturb the data.Substitution refers to many disclosure control techniques including recoding, random rounding, addition of random noise, data swapping, and imputation (synthetic data).Following this step, a subsampling step is applied to add further protection to the data.Finally, the released data are calibrated such that specific estimates based on the released data match the estimates based on the original data.The authors note that this "helps reduce the bias caused by substitution" (Singh et al., 2003, Page 9).One very desirable property of MASSC is that both disclosure risk and information loss can be controlled for simultaneously.

Assessing privacy
In order to assess privacy, disclosure must be defined since different definitions of disclosure will lead to different definitions of privacy.Willenborg and de Waal (2001) categorized disclosure risk into two main categories, namely, the risk of re-identification and the risk of predictive disclosures.A re-identification occurs when one is able to accurately identify an individual in the released data, whereas a predictive disclosure occurs when the value of some unknown sensitive attribute can be estimated with reasonable accuracy.Duncan and Lambert (1989) defined four types of privacy, identity disclosure, attribute disclosure, inferential disclosure, and population disclosure.Identity disclosure is, as before, being able to accurately identify and individual in the released microdata.Attribute disclosure occurs when an intruder is able to obtain "reliable information about an individual as the result of linking".Inferential disclosure occurs when a consumer of the released data is able to infer new information about an individual even without linking to a specific observation.
Finally, population, or model, disclosure occurs if a confidential information about a population can be inferred through the construction of a model based on the released microdata.
Regardless of how privacy is defined, any organization which plans on implementing disclosure control techniques in order to privately release data to the public needs to be aware of the trade-off between privacy and data utility.It is always possible to increase the privacy of any specific data release, but this almost assuredly comes with a loss of data utility.Therefore, privacy cannot be assessed by itself, it must always be measured in conjunction with the utility of the data after privacy preserving techniques have been applied.
In this section, we will first discuss measures of privacy based on the threat of re-identification and attribute disclosure.This is followed by a review of procedures for assessing privacy based on inferential disclosures and population disclosure.Spruill (1982) discussed the confidentiality of several methods of protecting public release microdata files.First, a subset of the data was chosen and then a masking procedure was applied.The masking procedures they tested included: addition of normal random error, grouping, random rounding, and data swapping.The proposed measure of confidentiality is based around how many records in the released data can be linked to their respective data in the unmasked data.Calculation of the measure of confidentiality is as follows: An element of the data is selected and masked.The masked element is then compared to all elements in the unmasked data.The element in the unmasked data which minimizes the absolute deviations or squared error is selected.If the selected element from the unmasked data is the same as the element that generated the chosen masked element, then they say a link has been made.The measure of confidentiality is simply the percent of the elements for which a match cannot be made.Spruill (1983) presents a demonstration of the privacy measure using real data.Paass (1988) assessed privacy based on the number of matches that can be made between some additional information and the released data.A match occurs, for their purposes, when a record in the additional information matches or nearly matches a record in the released data.Along with this, Paass (1988) additionally required that, with some large probability, the matched record from the additional information does not belong to another element of the released data.As such, they suggest framing the problem as a discriminant analysis for linking records, and the proposed measure of privacy is the percentage of records which were threatened by identification.They review slicing, micro-aggregates, and recombinations while noting that the addition of random noise does little to protect confidentiality in the framework.

Re-identification measures
They conclude that microdata should be released with only a few variables making it difficult or impossible for an intruder to link records.However, for data sets with a large number of variables, it becomes very easy to create a privacy breach.As such they suggest that the only way to protect data with a large number of variables is through "massive modifications of the data" (Paass, 1988) which leads to reduced utility in analysis of the perturbed data, especially for more complicated statistical techniques.Duncan and Lambert (1989) proposed a measure of privacy based on decision theory.They approached the problem from the point of view that an intruder is searching for a specific target record which they refer to as t 0 .After viewing the data, the possible values of the target are described by some predictive distribution, p y (s).Further, following their decision theoretic approach, they define a loss function, L(t, s) where the action taken is choosing the target to be t, but, in fact, s is the actual target.Therefore, the expected loss can be found by integrating over all possible values of s, namely, L(t, s)p y (s)ds.The t which minimizes expected loss is the best choice for the intruder.Along with this they define the uncertainty of the intruder as U (y) = inf t L(t, s)p y (s)ds which is minimum expected loss.This quantity, which can be viewed as the intruders uncertainty, shows that the data are well protected when this gets large indicating more uncertainty.Reiter (2005a) the Duncan-Lambert framework to assess privacy of several disclosure control techniques under different assumptions of the knowledge of the intruder.Bethlehem et al. (1990) proposed a measure of privacy that they refer to as the resolution of the "key", a set of variables used for identification.The measure is based on the uniqueness of elements in the population.Here, the term "key" is similar to the term "quasi-identifier" used in Dalenius (1986) and, later, in Sweeney (2002b).They defined the resolution as R = ( K i=1 π 2 i ) −1 where π i = Fi N with N being the size of the population and F i is the number of elements in the population with key value i for i = 1...k.Marsh et al. (1991) proposed a measure of quantifying privacy which uses uniqueness as a component along with other quantities.They are mainly interested in assessing privacy in a sample of anonymized records (SAR).They suggested that "One way to think of the real risk is as the total probability of an individual being identified from the SAR."As such, they set forth four conditions which the user must create to cause a privacy issue: a. Key variables recorded identically in both data sets b.Presence in the SAR c. Population Uniqueness d.Verification of population uniqueness Using these they define a measure of privacy based on the conditional probability of an identification occurring given that an attempt at a privacy breach has taken place.Skinner et al. (1994) proposed a measure of privacy based on, not only population uniqueness (PU), but also sample uniqueness (SU).Namely, they proposed assessing privacy as Pr(PU | SU).Previously, the probability of a PU was used to assess privacy.However, that method misses the fact that the intruder will only have access to the released sample of the data.Therefore, the intruder can only possibly create a privacy breach for observations that are SU.So, a privacy breach occurs when a SU is verified to be PU.They outline three possible ways that a record could be verified as PU: 1. Population lists -If an individual had access to a population list of identifying features, individuals could easily be verified as unique.2. Statistical Inference -A statistical argument can be made that certain combinations of characteristics are extremely rare and are, with some high probability, population unique.

Figures in the public eye -Certain combinations of characteristics will
exist that the public will know all people with those characteristics.They offer, among others, the example of a police chief with 9 children and a Ph.D. This record is easily identifiable in the population.They also mention that easily recognized groups will be easy to identify, such as, if occupation is listed as "US Senator".That population is easily verified.
They note that statistical inference is the most likely way to verify population uniqueness, as the census has a good deal of control in preventing the other two.The authors go on to demonstrate how the Poisson-Gamma model of Bethlehem et al. (1990) would be used in this situation for estimating the probability of population uniqueness.Skinner and Elliot (2002) discussed and reviewed two previous measures of privacy, including the Skinner et al. (1994) proposal.The paper then goes on to propose another measure based on Elliot (2000).The first measure of privacy is Pr(PU), and the second is the proportion of records which are both PU and SU to the number of records which are SU.The new measure of privacy, which Skinner and Elliot (2002) refer to as Θ is where f j and F j are, respectively, the frequency of the j-th combination of identifying features in the sample and the frequency of the j-th combination of identifying features in the population.This measure can be thought of as the probability of a correct match given the probability of a unique match.Skinner and Elliot (2002) go on to review these three measures.They argue that the first measure, P r(P U ), which is the probability that an observation is PU, is overly optimistic.So they go on to compare the second measure, P r(P U |SU ), which measures the probability that an observation is P U given that it is SU.They argue that their proposed measure, Θ, is an improvement over these measures.Skinner and Shlomo (2008) investigated privacy measures which are based on the risk of re-identification.The measures they were interested in involve f j and F j which are, as before, respectively, the frequency of the j-th combination of identifying features in the sample and the frequency of the j-th combination of identifying features in the population.In a practical setting, since the f j are observed but the F j are not, the F j must be estimated based on the observed f j 's.F j can be estimated using log-linear models.Thus, they break their general approach down into three steps: 1. Specifying the 'key' variables 2. Selecting one or more log-linear models which fit well according to the diagnostic criteria developed in Skinner and Shlomo (2008) 3. Use the well-fitting models to obtain risk estimates.
Here, the 'key' variables mentioned in step 1 are the same 'key' variables described in Bethlehem et al. (1990) which are variables that could be used for identification.Sweeney (2002b) described a measure of privacy called k-anonymity, similar to methods described in Dalenius (1986).This measure is based on what Dalenius (1986) terms a quasi-identifer, which is a set of attributes in a data set that could be used for matching with an external database.Dalenius (1986) mentioned two situations, the first when an individual has a unique set of identifiers and, the second, when only a small number of individuals has a specific set of identifiers.In the first case, it is easy to identify the individual.In the second, it would be possible to identify an individual through collusion.For example, if k individuals have a specific set of identifiers, k − 1 individuals who have a specific trait could get together and accurately identify the remaining individual.
Sweeney (2002b) offered a real world example of a privacy breach using publicly available voting records and de-identified insurance data.Based on the unique combination of attributes from the voting records, she accurately identified former Massachusetts' governor William Weld's released insurance data.This matching was easy, as the ex-governor's combination of attributes was unique in the population.Therefore, to improve privacy, a table with multiple observations for all observed combinations of quasi-identifiers is desirable.
Thus, k-anonymity is achieved in a table if for each combination of a quasiidentifer for that table, the quasi-identifier combination appears at least k times in the table.Examples are shown in tables 2A and 2B which achieves 3-anonymity with quasi-identifier gender and race.Sweeney (2002a) described a procedure for achieving this level of security via generalization and suppression.Generalization is achieved by grouping a possibly identifying procedure into a broader category.For example, rather than report the town or city someone lives in, a more general grouping can be achieved by reporting only the state of residence.Suppression is simply not releasing a sensitive value.By using both generalization and suppression, k-anonymity can be achieved for any data set.Two algorithms, Data fly (Sweeney, 1997) and µ-Argus (Hundepool andWillenborg, 1996, Hundepool et al., Feb. 2005), can be used to anonymize data using this approach, however, Sweeney (2002a, Page 12) notes that "Datafly can over distort data and µ-Argus can additionally fail to provide adequate protection.".Machanavajjhala et al. (2007) described the shortcomings of k-anonymity by explaining how privacy can still be breached even when k-anonymity is achieved.
Two types of attacks on privacy are mentioned.The first is an attack when the sensitive attributes lack diversity.Table 2B achieves 3-anonymity, but here all white males suffer from prostate cancer.Thus a disclosure has taken place because one can now infer that each specific white male in the database has prostate cancer.Note that an identity disclosure has not taken place here.Our target may be a white male and our goal as an intruder is to find out what type of cancer they have.While we are unable to match to a particular record in the database, we can still discover that our target must have prostate cancer (provided we know our target is in the database).
Another type of attack discussed in Machanavajjhala et al. (2007) is one when the intruder has background information.This could occur if, for example, he or she knew the identity of the white female in this database with lung cancer.Using the data in table 2B together with this background information allows one to conclude that each of the other white females in this database must have breast cancer.Again, we cannot match the records of the white females with breast cancer to specific individuals, but the sensitive attribute is still disclosed.
In response to the problems presented from k-anonymity, Machanavajjhala et al. (2007) proposed l-diversity.l-diversity ensures that within each equivalence class, that the values of the sensitive attributes are all "well represented".Tables 3A and 3B both demonstrate a 3-diverse table.Li et al. (2007) discussed the weaknesses of l-diversity and proposes t-closeness as an alternative.Li et al. (2007) mentions two types of attacks that are possible even when l-diversity is achieved.The first, a skewness attack, occurs when the distribution of the sensitive attributes differs significantly from that of the overall population.For example, if some disease in a population is rare, but the prevalence of a disease within an equivalence class is much higher.In this scenario, an intruder has gained some sensitive information about a group of people.The second type of attack is a similarity attack.This occurs when sensitive attributes are technically different, but similar in nature.Li et al. (2007) offered an example where an intruder is able to infer that an individual has that there is a database which gives someone information about the average heights of women of different nationalities.If an intruder has access to this database and the information that "Terry Gross is two inches shorter that the average Lithuanian woman", they now know the exact height of Terry Gross.
Note that in the last two examples, the target does not need to be part of the released information for a privacy breach to occur.This is quite a startling result.Not only does one need to protect the privacy of the individuals in the released microdata, one may also need to make privacy considerations for individuals who are not in the released database!3.2.1.Knowledge, Knowledge Gain, and Relative Knowledge Gain Duncan and Lambert (1986, Page 13) proposed three types of measures of disclosure risk all based on the following two principles: 1.The complete state of a user's uncertainty about a target before and after data release is specified by the user's prior and posterior predictive distributions, respectively.2. The user's uncertainty about a target can be summarized by applying a nonnegative concave function Let U (•) be an uncertainty function with larger values of U indicating more uncertainty (DeGroot, 1962(DeGroot, , 1970)).Therefore, privacy can be measured based on the difference between the uncertainty in the prior predictive distribution (U (prior)) and the uncertainty in the updated posterior predictive distribution (U (posterior)).Thus, Duncan and Lambert (1986) defined knowledge to be U (posterior), knowledge gain to be U (posterior) − U (prior), and relative knowledge gain to be U(posterior)−U(prior) U(prior) . Using one of these measures, data would not be released if the measure exceeded a pre-specified threshold.
One drawback to this method is that a data releasing agency must specify the prior beliefs of a potential consumer of the data.It is often impossible to know exactly what potential data a user has, complicating the specification of a user's prior.An agency will be able to more accurately define the prior beliefs of data user if they have some idea of what data sets are already available to a data consumer.Therefore, it may be useful for releasing agencies to keep records of what other data are readily available to a potential consumer of the data.
Finally, Duncan and Lambert (1986, Page 17) noted that "Although specification of uncertainty functions and disclosure limits may appear arbitrary and difficult to justify, it is also difficult to justify rigorously ad hoc rules for releasing data."This statement emphasizes the need for an easily interpretable, reasoned metric for assessing privacy, with which a set of guidelines could be produced to direct data releasing agencies in the proper manner in which to release data to the public while maintaining sufficient levels of privacy.
privacy.This occurs when the randomized release function based on D does not change very much when we based the same function on D ′ .Thus, we can define risk as the maximum value of the AUC over all possible neighboring databases D and D ′ : Risk = max D,D ′ AU C. Similarly, we can define P rivacy = 1 − Risk which ranges from 0, low privacy, to .5, high privacy.

Model Disclosure
Palley and Simonoff (1987) demonstrated how privacy in a database can be breached by using regression to infer the values of confidential attributes relying on data from simple queries to the database.They used R 2 as a measure of their predictive accuracy.Thus 1 − R 2 can be viewed as the level of privacy.However, simply because one confidential attribute is private does not mean that all confidential attributes are private.This is discussed in Tendick (1991).Sarathy and Muralidhar (2002b) proposed the use of canonical correlation analysis as a technique for assessing privacy.Canonical correlation analysis was used, in general, to find and quantify relationships between two groups of variables.In a privacy setting, one is interested in the relationship between the variables with no privacy restrictions and the variables that are confidential so canonical correlation analysis lends itself in some natural way to the assessment of privacy.et al. (2009) proposed a measure of privacy based on the idea that observations far from the mean will be easily identified.However, rather than measuring an observation's distance from the overall mean, the data are first clustered, and observations which are far away from the center of the respective cluster are considered to be at high risk for disclosure.Further, Rajasekaran et al. (2009) also suggests that if observations fall into clusters with too few observations, they should be suppressed as they are observations at high risk of disclosure.

Summary
Statistical disclosure limitation is a very broad topic.However, many areas of research depend on data that can only be used if privacy is maintained thus highlighting the importance of disclosure limitation.Perfect privacy, never releasing any confidential data, and perfect utility, releasing all confidential data, both present significant problems.Therefore, some balance must be struck between these competing goals.
This paper offers a summary of methods which have been proposed to maintain the privacy of the individual in public release data sets, as well as, a review of proposed techniques for assessing the privacy of some of the privacy preserving methods.
Clearly, there will be some trade off between the amount of privacy ensured and the utility of the released data.Many privacy preserving techniques work by perturbing the data to be released, resulting in potentially less useful data.Karr et al. (2006) formally discussed the risk utility trade-off for several methods of statistical disclosure techniques including addition of noise, rank swapping, microaggregation, and resampling.Other references which discuss this trade-off include Domingo-Ferrer and Torra (2001b,a) and Kaufman et al. (2005).
While many methods of preserving privacy have been proposed, there are not, as of yet, any formal guidelines for many data releasing institutions to follow when releasing data to the public (although attempts have been made (United Nations Economic Comission for Europe (UNECE), 2007, Hundepool et al., 2006, Federal Committee on Statistical Methodology (FCSM), 2005)).Most data releases are deemed to be "private enough" with no formal assessment of the privacy ensured.Eventually, legal guidelines will need to be set as to what constitutes adequate amounts of privacy.This accentuates the need for an easily interpretable measure of privacy that can be used by policy makers in drafting legislation pertaining to adequate privacy levels for public release data sets.
While this paper is by no means an exhaustive reference on statistical disclosure limitation, the hope is that this manuscript will provide an introduction to disclosure limitation techniques, as well as, methods for measuring privacy.
The goal of statistical disclosure limitation efforts is to make sure that data are used for research, rather than malicious purposes, including the disclosure of individuals' private information.More work is needed in both areas, the development of statistical disclosure control techniques and the assessment of privacy, as the importance of this field becomes increasingly relevant as we continue on into an age ruled by data.