Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	395/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 391 392 393 394 395 396 397 398 ... 423

1-Data Mining tarjima

.

AGE . .

.
.
. .^._{. .}^.
_.^. .
._.^{. .}_..
_.^.. .
. _. .
. ^.

SALARY

Figure 20.5: A sample 5-anonymous Mondrian multidimensional partitioning

of the entire data set to a single multidimensional region, and therefore trivially satisfies k-anonymity. The algorithm therefore starts by initializing B = {B}. The algorithm repeat-edly uses the following steps:

Select a rectangular region R ∈ B containing at least 2 · k data points, such that a valid split into a pair of k-anonymous subsets exists.

Split the rectangular region R along any of the dimensions with an axis-parallel split, so that each of R₁ and R₂ contains at least k data points.

Update B ⇐ B ∪ {R₁, R₂} − R

This iterative process is repeated, until the rectangular regions cannot be split any further without violating k-anonymity. There is some flexibility in the choice of the dimension for performing the split. A natural heuristic is to split the longest dimension of the selected rectangular region. After the dimension has been selected, the split should be performed so that the data points are partitioned as evenly as possible. In the absence of ties on attribute values, the data points can be divided almost equally into the two regions.

The rectangular regions in B define the equivalence classes that are utilized for k-anonymization. If each numeric attribute value is unique, it can be shown that every region will contain at most 2 · k − 1 data points. However, if there are ties among attribute val-ues, and tied values need to be assigned to be the same partition, then an upper bound of m + 2d · (k − 1) can be shown on the number of data points in each partition. Here m is the number of identical copies of any data record. On the other hand, if ties on an attribute value can be flexibly assigned to any partition, then the maximum number of points in any rectangular partition, at the end of the process will be 2 · k − 1. The reader is referred to the bibliographic notes for the pointer to the proof of this bound. After the data has been divided into rectangular regions, the following approaches can be used for reporting the anonymized data points:

The averages along each dimension may be reported for each anonymized equivalence set.

The multidimensional bounding box of the data points may be reported.

The Mondrian algorithm has been shown to be more eﬀective than the Incognito algorithm, because of the greater flexibility provided by the multidimensional approach to partitioning.

680 CHAPTER 20. PRIVACY-PRESERVING DATA MINING

The Mondrian approach is naturally designed for numeric attributes with an ordering on the values. However, the approach can also be generalized to categorical attributes by designing appropriate split rules for the attributes.

20.3.1.4 Synthetic Data Generation: Condensation-Based Approach

The condensation-based approach generates synthetic data that matches the original data distribution, while maintaining k-anonymity. This means that k synthetic records are gen-erated for each group of k records, by using the statistics of that group. The overall con-densation approach may be described as follows:

Use any clustering approach to partition the data into groups of data records, such that each group contains at least k data records. Denote the number of created groups by m.

Compute the mean and covariance matrix for each group of data records. For a d-dimensional data set, the covariance matrix of a group represents the d×d covariances between pairs of attributes.

Compute the eigenvectors and eigenvalues of each covariance matrix. It is evident from the discussion of Principal Component Analysis (PCA) in Chap. 2, that the eigenvec-tors define a group-specific axis system, along which the data records are uncorrelated. The variance of the data along each eigenvector is equal to the corresponding eigen-value. The synthetic data set to be generated, is modeled as mixture of m clusters, where the mean of each cluster is the mean of the corresponding group of original data records.

Generate synthetic data records for each of the m clusters. For each cluster, the number and mean of the synthetic records matches its base group. Data records are gener-ated independently along the eigenvectors, with variance equal to the corresponding eigenvalues. The uniform distribution is typically used for synthetic data generation, because it is assumed that the data distribution does not change significantly within the small locality defined by a group. While the uniform distribution is a local approx-imation, the global distribution of the generated records generally matches the original data quite well.

The approach can also be generalized to data streams, by maintaining group statistics incrementally. The idea here is that group sizes are allowed to vary between k and 2 · k − 1. Whenever a group reaches the size of 2 · k, they are split into two groups. The details of group splitting will be discussed later.

To maintain the covariance statistics incrementally in the streaming scenario, an approach similar to the cluster-feature vector of CluStream (see Chap. 7) is used. The only diﬀerence is that the product-wise sum statistics are also maintained incrementally. For any pair of attributes i and j, the value of Sum(i, j) is equal to sum of the product of attribute values i and j over the diﬀerent data points. This can be easily maintained incrementally in a data stream. Then, for a set of r ∈ (k, 2 · k − 1) data points in a group, the covariance between attributes i and j may be estimated as follows:

Covariance(i, j) = Sum(i, j)/r − Mean(i) · Mean(j)	(20.6)
Covariance(i, j) = Sum(i, j)/r − Sum(i) · Sum(j)/r²	(20.7)

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 391 392 393 394 395 396 397 398 ... 423