Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	402/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 398 399 400 401 402 403 404 405 ... 423

1-Data Mining tarjima

20.7. BIBLIOGRAPHIC NOTES

691

of sensitive information. Two such well known techniques are association rule hiding, and query auditing.

In distributed privacy, the goal is to allow adversaries or semi-adversaries to collaborate in the sharing of data, for global insights. The data may be vertically partitioned across columns, or horizontally partitioned across rows. Cryptographic protocols are typically used in order to achieve this goal. The most well-known among these is the oblivious transfer protocol. Typically, these protocols are used to implement primitive data mining operations, such as the dot product. These primitive operations are then leveraged in data mining algorithms.

20.7 Bibliographic Notes

The problem of privacy-preserving data mining has been studied extensively in the sta-tistical disclosure control and security community [1, 512]. Numerous methods, such as swapping [181], micro-aggregation [186], and suppression [179], have been proposed in the conventional statistical disclosure control literature.

The problem of privacy-preserving data mining was formally introduced in [60] to the broader data mining community. The work in [28] established models for quantification of privacy-preserving data mining algorithms. Surveys on privacy -preserving data mining may be found in [29]. The randomization method was generalized to other problems, such as association rule mining [200]. Multiplicative perturbations have also been shown to be very eﬀective in the context of privacy-preserving data mining [140]. Nevertheless, numerous attack methods have been designed for inferring the values of the perturbed data records [11, 367].

The k-anonymity model was proposed by Samarati [442]. The binary search algorithm is also discussed in this work. This paper also set up the basic framework for group-based anonymization, which was subsequently used by all the diﬀerent privacy methods. The NP-hardness of the k-anonymity problem was formally proved in [ 385]. A survey of k-anonymous data mining may be found in [153]. The connections between the k-anonymity problem and the frequent pattern mining problem were shown in [83]. A set enumeration method was proposed in [83] that is similar to the set enumeration methods popularly used in frequent pattern mining. The Incognito and Mondrian algorithms, discussed in this chapter, were proposed in [335 ] and [336]. The condensation approach to privacy-preserving data mining was proposed in [8]. Some recent methods perform a probabilistic version of k-anonymity on the data, so that the output of the anonymization is a probability distribution [9]. Thus, such an approach allows the use of probabilistic database methods on the transformed data. Many metrics have also been proposed for utility-based evaluation of private tables, rather than simply using the minimal generalization height [29, 315].

The -diversity and t-closeness models were proposed in [348] and [372], respectively with a focus on sensitive attribute disclosure. A diﬀerent approach for addressing sensitive attributes is proposed in [91]. A detailed survey of many of the privacy-preserving data pub-lishing techniques may be found in [218]. A closely related model to group-based anonymiza-tion is diﬀerential privacy, where the diﬀerential impact of a data record on the privacy of other data records in the database is used to perform the privacy operations [190, 191]. While diﬀerential privacy provides theoretical more robust results than many group-based models, its practical utility is yet to be realized. The curse of dimensionality in the context of anonymization problems was first observed in [10]. Subsequently, it was shown that the curse extends to other privacy models such as perturbation and -diversity [11 , 12, 372].

692 CHAPTER 20. PRIVACY-PRESERVING DATA MINING

A practical example [402] of how high-dimensional data could be used to make privacy attacks is based on the Netflix data set [559 ]. Interestingly, this attack uses the sensitive ratings attributes and background knowledge to make identification attacks. Recently, a few methods [514, 533] have been proposed to address the curse of dimensionality in a limited way.

The problem of output privacy is closely related to the problem of inference control and auditing in statistical databases [150]. The most common problems addressed in this domain are those of association rule hiding [497], and query auditing [399]. Distributed methods transform data mining problems into secure multi-party computation primitives [ 188]. Typ-ically, these methods are dependent on the use of the oblivious transfer protocol [199, 401]. Most of these methods perform distributed privacy-preservation on either horizontally par-titioned data [297] or vertically partitioned data [495]. An overview of the various privacy tools for distributed information sharing may be found in [154].

20.8 Exercises

Suppose that you have a 1-dimensional dataset uniformly distributed in (0, 1). Uniform noise from the range (0, 1) is added to the data. Derive the final shape of the perturbed distribution.

Suppose that your perturbed data was uniformly distributed in (0, 1), and your per-turbing distribution was also uniformly distributed in (0, 1). Derive the original data distribution. Will this distribution be accurately reconstructed, in practice, for a finite data set?

Implement the Bayes distribution reconstruction algorithm for the randomization method.

Implement the (a) Incognito, and (b) Mondrian algorithms for k-anonymity.

Implement the condensation approach to k-anonymity.

In dynamic condensation, one of the steps is to split a group into two equal groups along the longest eigenvector. Let λ be the largest eigenvalue of the original group, μ be the original d-dimensional mean, and V be the longest eigenvector, which is normalized to unit norm. Compute algebraic expressions for the means of the two split groups, under the uniform distribution assumption.

Show that the sensitive attribute in both the entropy- and recursive- -diversity models

must have at least distinct values.

Show that the global entropy of the sensitive attribute distribution is at least equal to the minimum entropy of an equivalence class in it. [Hint: Use convexity of entropy]

Many k-anonymization algorithms such as Incognito depend upon the monotonicity property. Show that the monotonicity property is satisfied by (a) entropy -diversity, and (b) recursive -diversity.

Implement the (a) Incognito, and (b) Mondrian algorithms for entropy- and recursive -diversity, by making changes to your code in Exercise 4.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 398 399 400 401 402 403 404 405 ... 423