Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə396/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   392   393   394   395   396   397   398   399   ...   423
1-Data Mining tarjima

20.3. PRIVACY-PRESERVING DATA PUBLISHING

681

All the statistics in the aforementioned equation are additive, and can easily be maintained incrementally in the stream setting.


It remains to be explained how the groups are split, once the group sizes reach 2 · k. It is assumed that each group of size 2 · k is split into two groups of size k along the longest eigenvector. The reason for choosing the longest eigenvector is to ensure the compactness of the newly created groups. The splitting of groups can be a challenge, because the original data records are not available in the streaming scenario to recalculate the statistics of each of the split groups. Therefore, an approximation (i.e., modeling assumption) is needed. The condensation approach works with the modeling assumption that the data records of a group are independently distributed along each eigenvector according to a uniform distribution. For group sizes that are much smaller than the number of points in the data set, this is not an unreasonable assumption. This is because density distributions do not change drastically over small regions of the data.


This modeling assumption of a uniform distribution is used to re-calculate the new means of each of the child groups of equal size k. This is because the range of the uniform distribution along the longest eigenvector can be approximated from its variance (eigen-value), based on the modeling assumption. Note that the variance of a uniform distribution is one twelfth the square of its range. Therefore, if λmax be the largest eigenvalue, then the range R of the uniform distribution is computed as follows:





R =

12λmax

(20.8)

This range R is then split into two equal parts to create the two new group means. Thus, the two new group means are at a distance of R/4 from the old group mean in opposite directions along the longest eigenvector.


The newly created groups are assumed to have the same eigenvectors as the parent group, because the splitting is performed along an uncorrelated direction. Therefore, the directions of correlation are not assumed to change after splitting. The largest eigenvalue of the original (parent) group is replaced by an eigenvalue in each of the child groups, which is one fourth2 the original value. Thus, if P is the d × d matrix with orthonormal columns containing the eigenvectors, and Σ is the diagonal matrix of eigenvalues (after adjustment of the largest eigenvalue), then the covariance matrix of the newly created split groups can be computed as follows:



C = PΣPT

(20.9)

This relationship is based on the standard PCA diagonalization discussed in Chap. 2. Note that the covariance matrices of both the split groups are the same. The covariance matri-ces and newly generated group means can be used to back-calculate the sum of pairwise attribute products of each group according to Eq. 20.6. Thus, as more data points arrive, these product values can continue to be updated incrementally.


The condensation-based approach is one of the few methods that can be applied to data streams with a relatively low risk of disclosure, because of its approach of using synthetic data. It is often difficult for an adversary to know which group of k synthetic records was generated from a particular base group of original records. In the case of generalization-based anonymization, it is relatively easy to identify groups of related data records, representing equivalence classes. Thus, synthetic data sets provide some additional privacy protection. Note that it is possible to generate larger data sets using this approach if needed. For example, for each group of k records, one might generate α · k synthetic data records,





  • Splitting a uniform distribution into two equal parts reduces its variance by a factor of 4.

682 CHAPTER 20. PRIVACY-PRESERVING DATA MINING

using the statistics of that group. This scales up the size of the data with a factor of α, and further reduces the mapping between the generated data and the original data. Furthermore, additional noise can be incorporated during synthetic data generation to ensure greater protection.


These additional options do come at a price. The truthfulness of the published data is lost. The published data records are synthetic and therefore do not map onto any particular individual. In many aggregation- or modeling -based applications, this is not necessarily an issue, because the aggregate properties of the data are retained. In some medical data han-dling scenarios, legal restrictions may prohibit release of downgraded data, when there is a direct mapping between individuals and data records, even at a group level. The conden-sation approach provides a solution in some of these scenarios, because the released data records are synthetic, and are generally difficult to map onto specific groups.


The condensation approach shares a number of conceptual similarities with the Mon-drian approach, except that it allows the use of any constrained clustering algorithm, rather than rectangular partitions constructed with single dimensional cuts. The utility of the resulting anonymization depends on the effectiveness of the clustering. Single dimensional cuts will not be able to construct high-quality clusters with increasing dimensionality. Fur-thermore, unlike Mondrian, synthetic data is generated to achieve greater anonymity.


The condensation approach does not distinguish between publicly available attributes (used in combination to construct quasi - identifiers) and sensitive attributes, and applies the approach to all the attributes. As will be evident from the subsequent discussion on the dimensionality curse in Sect. 20.3.4, the distinction between quasi-identifier and sensitive attributes is more fluid, than is often assumed in the literature on data privacy. Because it is not possible to know the level of background knowledge available to adversaries about the sensitive attributes, all attributes should be perturbed. When the sensitive attributes are released without any perturbation, they become immediately available for identification attacks, as long as background knowledge is available. For example, a number of privacy attacks on data sets such as the Netflix data set [402], have been performed using attributes that would normally not have been considered publicly available. This work [ 402] also makes the argument that such strong distinctions between publicly available and sensitive attributes are dangerous to make in real-world settings where the data and background knowledge available to the public continues to increase over time.


20.3.2 The -Diversity Model

While the k-anonymity model provides the basic framework for privacy-preserving data publishing, there are scenarios in which it can lead to inadvertent sensitive attribute dis-closure. Consider the 3- anonymized table illustrated in Table 20.3. In this case, the row indices 1, 3, and 6 are in the same anonymized group, and cannot be distinguished from one another. However, all three individuals have the value of “HIV” on the sensitive attribute. Therefore, even though the identity of the specific individual from this group cannot be inferred, it can be inferred that any individual in this group has HIV. Therefore, if a voter registration roll is used to join this group to three unique individuals, then it can be inferred that all three of them have HIV. This represents a breach of sensitive attribute information about each of these three individuals. In other words, while the k-anonymity model prevents





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   392   393   394   395   396   397   398   399   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin