Distribution reconstruction: The aggregate distribution of the original data are recon-structed, by “subtracting out” the noise. Thus, at the end of this step, we will have a histogram representing the approximate probability distribution of the data values.
Data mining: Data mining methods are applied to the reconstructed distributions.
It is important to note that the last step of the process requires the design of data mining algorithms that work with probability distributions of sets of data records, rather than individual records. Thus, one disadvantage of this approach is that it requires the redesign of data mining algorithms. Nevertheless, the approach can be made to work because many data mining problems such as clustering and classification require only the probability distribution modeling of either the whole data set, or segments (e.g., different classes) of the data.
20.2.1 Reconstructing Aggregate Distributions
The reconstruction of the aggregate distribution of the original data is the key step in the randomization method. Consider the case where the original data values x1 . . . xn are drawn from the probability distribution X. For each original data value xi, a perturbation yi is added by the software data collection tool, to yield the perturbed value zi. The perturbation yi is drawn from the probability distribution Y, and is independent of X. It is assumed that this distribution is known publicly. Furthermore, the probability distribution of the final set of perturbed values is assumed to be Z . Therefore, the original distribution X, the added perturbation Y and the final aggregate distribution Z are related as follows:
Z=X+Y
X=Z−Y
Thus, the probability distribution of X can be reconstructed, if the distributions of Y and Z are known explicitly. The probability distribution of Y is assumed to be publicly available, while discrete samples of Z are available in terms of z1 . . . zn. These discrete samples are sufficient to reconstruct Z using a variety of methods, such as kernel density estimation. Then, the distribution for X can be reconstructed using the relationship shown above. The main problem with this approach emerges when the probability distribution of the perturbation Y has a large variance and the number n of discrete samples of Z is small. In such a case, the distribution of Z also has a large variance, and it cannot be accurately estimated with a small number of samples. Therefore, a second approach is to directly estimate the distribution of X from the discrete samples of Z and the known distribution of Y.
Let fX and FX be the probability density and cumulative distributions functions of X.
ˆ ˆ
These functions need to be estimated with the observed values z1 . . . zn. Let fX and FX be
the corresponding estimated probability density and cumulative distribution functions for X. The key here is to use the Bayes formula with the use of observed values for Z. Consider a simplified scenario in which only a single observed value z1 is available. This can be used
666 CHAPTER 20. PRIVACY-PRESERVING DATA MINING
to estimate the cumulative distribution function
X = a. The Bayes theorem yields the following:
FX a
Dostları ilə paylaş: |