Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	390/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 386 387 388 389 390 391 392 393 ... 423

1-Data Mining tarjima

Distribution reconstruction: The aggregate distribution of the original data are recon-structed, by “subtracting out” the noise. Thus, at the end of this step, we will have a histogram representing the approximate probability distribution of the data values.

Data mining: Data mining methods are applied to the reconstructed distributions.

It is important to note that the last step of the process requires the design of data mining algorithms that work with probability distributions of sets of data records, rather than individual records. Thus, one disadvantage of this approach is that it requires the redesign of data mining algorithms. Nevertheless, the approach can be made to work because many data mining problems such as clustering and classification require only the probability distribution modeling of either the whole data set, or segments (e.g., diﬀerent classes) of the data.

20.2.1 Reconstructing Aggregate Distributions

The reconstruction of the aggregate distribution of the original data is the key step in the randomization method. Consider the case where the original data values x₁ . . . x_n are drawn from the probability distribution X. For each original data value x_i, a perturbation y_i is added by the software data collection tool, to yield the perturbed value z_i. The perturbation y_i is drawn from the probability distribution Y, and is independent of X. It is assumed that this distribution is known publicly. Furthermore, the probability distribution of the final set of perturbed values is assumed to be Z . Therefore, the original distribution X, the added perturbation Y and the final aggregate distribution Z are related as follows:

Z=X+Y

X=Z−Y

Thus, the probability distribution of X can be reconstructed, if the distributions of Y and Z are known explicitly. The probability distribution of Y is assumed to be publicly available, while discrete samples of Z are available in terms of z₁ . . . z_n. These discrete samples are suﬃcient to reconstruct Z using a variety of methods, such as kernel density estimation. Then, the distribution for X can be reconstructed using the relationship shown above. The main problem with this approach emerges when the probability distribution of the perturbation Y has a large variance and the number n of discrete samples of Z is small. In such a case, the distribution of Z also has a large variance, and it cannot be accurately estimated with a small number of samples. Therefore, a second approach is to directly estimate the distribution of X from the discrete samples of Z and the known distribution of Y.

Let fX and FX be the probability density and cumulative distributions functions of X.

ˆ ˆ
These functions need to be estimated with the observed values z₁ . . . z_n. Let f_X and F_X be
the corresponding estimated probability density and cumulative distribution functions for X. The key here is to use the Bayes formula with the use of observed values for Z. Consider a simplified scenario in which only a single observed value z₁ is available. This can be used

666 CHAPTER 20. PRIVACY-PRESERVING DATA MINING

to estimate the cumulative distribution function

X = a. The Bayes theorem yields the following:

( ) at any value of the random variable

F_X a

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 386 387 388 389 390 391 392 393 ... 423