Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	400/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 396 397 398 399 400 401 402 403 ... 423

1-Data Mining tarjima

P	Q
	Q
	2
	2

Kullback-Leibler (KL) distance: This is an information-theoretic measure that com-putes the diﬀerence between the cross-entropy of (P , Q), and the entropy of P .

Dist(

) =

(p_i · log(p_i) − p_i · log(q_i))

(20.14)

i=1

Note that the

entropy of the first distribution is

− _i₌₁ p_i · log(p_i), whereas the cross-

entropy is −

_i₌₁^pi ^· log(^qi).

While these are the two most common distance measures used, other distance measures can be used in the context of diﬀerent application-specific goals.

For example, one may wish to prevent scenarios in which a particular equivalence class contains semantically related sensitive attribute values. Consider the scenario, where a par-ticular equivalence class contains diseases such as gastric ulcer, gastritis, and stomach can-cer. In such cases, if a group contains only these diseases, then it provides significant infor-mation about the sensitive attribute of that group. The t-closeness method prevents this scenario by changing the distance measure, and taking the distance between diﬀerent values of the sensitive attribute into account in the distance-computation process. In particular, the Earth Mover Distance can be used eﬀectively for this scenario.

The earth mover’s distance (EMD) is defined in terms of the “work” (or cost) required to transform one distribution to the other, if we allow sensitive attribute values in the original data to be flipped. Obviously, it requires less “work” to flip a sensitive value to a semantically similar value. Formally, let d_ij be the amount of “work” required to transform the ith sensitive value to the jth sensitive value, and let f_ij be the fraction of data records which are flipped from attribute value i to attribute value j . The values of d_ij are provided by a domain expert. Note that there are many diﬀerent ways to flip the distribution (p ₁ . . . p_r) to the distribution (q₁ . . . q_r), and it is desired to use the least cost sequence of flips to

686 CHAPTER 20. PRIVACY-PRESERVING DATA MINING

compute the distance between P and Q. For example, one would rather flip “gastric ulcer” to “gastritis” rather than flipping “HIV” to “gastritis” because the former is likely to have lower cost. Therefore, f_ij is a variable in a linear programming optimization problem, which is constructed to minimize the overall cost of flips. For a table with r distinct sensitive

attribute values, the cost of flips is given by	r	r
attribute values, the cost of flips is given by	i=1	_j₌₁f_ij · d_ij . The earth mover’s distance

may be posed as an optimization problem that minimizes this objective function subject to constraints on the aggregate flips involving each sensitive attribute value. The constraints ensure that the aggregate flips do transform the distribution P to Q.

				r r
Dist(		,		) = Minimize	^fij ^· ^dij
Dist(	P	,	Q	) = Minimize	^fij ^· ^dij
				i=1 j=1
				subject to:
				r		r	∀i ∈ {1 . . . r}
				p_i −		^fij + ^fji = ^qi	∀i ∈ {1 . . . r}
				j=1		j=1
				f_ij ≥ 0			∀i, j ∈ {1, . . . r}

The earth mover’s distance has certain properties that simplify the computation of gener-alizations satisfying t-closeness.

Lemma 20.3.3 Let E₁ and E₂ be two equivalence classes, and let P₁, P₂ be their sensitive attribute distributions. Let P be the distribution of E₁ ∪ E₂, and Q be the global distribution of the full data set. Then, it can be shown that:

Dist(

)

≤

|E₁|

Dist(

) +

|E₂|

Dist(

)

(20.15)

|E₁| + |E₂|

This lemma is a result of the fact that the optimal objective function of a linear programming formulation is convex, and P can be expressed as a convex linear combination of P₁ and P₂

with coeﬃcients	\|E₁\|	and	\|E₂\|	, respectively. This convexity result also implies the
following:	\|E₁\|+\|E₂\|		\|E₁\|+\|E₂\|
following:

Dist(P , Q) ≤ max{Dist(P₁, Q), Dist(P₂, Q)}

Therefore, when two equivalence classes satisfying t-closeness are merged, the merged equivalence class will also satisfy t-closeness. This implies the monotonicity property for t-closeness.

Lemma 20.3.4 (t-closeness monotonicity) If a table satisfies t-closeness, then any gen-eralization of the table satisfies t-closeness as well.

The proof of this lemma follows from the fact that the generalization A of any table B, contains equivalence classes that are the union of equivalence classes in B. If each equivalence class in B already satisfies t-closeness, then the corresponding union of these equivalence classes must satisfy t-closeness. Therefore, the generalized table must also satisfy t-closeness. This monotonicity property implies that all existing algorithms for k-anonymity can be directly used for t-closeness. The k-anonymity test is replaced with a test for t-closeness.

20.3. PRIVACY-PRESERVING DATA PUBLISHING

687

20.3.4 The Curse of Dimensionality

As discussed at various places in this book, the curse of dimensionality causes challenges for many data mining problems. Privacy preservation is also one of the problems aﬀected by the curse of dimensionality. There are two primary ways in which the curse of dimensionality impacts the eﬀectiveness of anonymization algorithms:

Computational challenges: It has been shown [385], that optimal k-anonymization is NP-hard. This implies that with increasing dimensionality, it becomes more diﬃcult to perform privacy preservation. The NP-hardness result also applies to the -diversity and t-closeness models, using a very similar argument.

Qualitative challenges: The qualitative challenges to privacy preservation are even more fundamental. Recently, it has been shown that it may be diﬃcult to per-form eﬀective privacy preservation without losing the utility of the anonymized data records. This is an even more fundamental challenge, because it makes the privacy-preservation process less practical. The discussion of this section will be centered on this issue.

In the following, a discussion of the qualitative impact of the dimensionality curse on group-based anonymization methods will be provided. While a formal mathematical proof [10] is beyond the scope of this book, an intuitive version of the argument is presented. To understand why the curse of dimensionality increases the likelihood of breaches, one only needs to understand the well-known notion of high dimensional data sparsity. For ease in understanding, consider the case of numeric attributes. A generalized representation of a table can be considered a rectangular region in d-dimensional space, where d is the number of quasi-identifiers. Let F_i ∈ (0, 1) be the fraction of the range of dimension i covered by a particular generalization. For the anonymized data set to be useful, the value of F_i should be as small as possible. However, the fractional volume of the space, covered by a generalization with fractional domain ranges of F1 . . . Fd, is given by ^d Fi. This fraction

i=1
converges to 0 exponentially fast with increasing dimensionality d. As a result, the fraction of data points within the volume also reduces rapidly, especially if the correlations among the diﬀerent dimensions are weak. For large enough values of d, it will be diﬃcult to create d-dimensional regions containing at least k data points, unless the values of F_i are chosen to be close to 1. In such cases, any value of an attribute is generalized to almost the entire range of values. Such a highly generalized data set therefore loses its utility for data mining purposes. This general principle has also been shown to be true for other privacy models, such as perturbation, and -diversity. The bibliographic notes contain pointers to some of these theoretical results.

A real-world example of this scenario is the Netflix Prize data set, in which Netflix released ratings of individuals [559 ] for movies to facilitate the study of collaborative filtering algorithms. Many other sources of data could be found, such as the Internet Movie Database (IMDb), from which the ratings information could be matched with the Netflix prize data set. It was shown that the identity of users could be breached with a very high level of accuracy, as the number of ratings (specified dimensionality) increased [402]. Eventually, Netflix retracted the data set.

688 CHAPTER 20. PRIVACY-PRESERVING DATA MINING

20.4 Output Privacy

The privacy-preservation process can be applied at any point in the data mining pipeline, starting with data collection, publishing, and finally, the actual application of the data min-ing process. The output of data mining algorithms can be very informative to an adversary. In particular, data mining algorithms with a large output and exhaustive data descriptions are particularly risky in the context of disclosure. For example, consider an association rule mining algorithm, in which the following rule is generated with high confidence:

(Age = 26, ZIP Code = 10562) ⇒ HIV

This association rule is detrimental to the privacy of an individual satisfying the condition on the left hand side of the aforementioned rule. Therefore, the discovery of this rule may result in the unforseen disclosure of private information about an individual. In general, many databases may have revealing relationships among subsets of attributes because of the constraints and strong statistical relationships between attribute values.

The problem of association rule hiding may be considered a variation of the problem of statistical disclosure control, or database inference control. In these problems, the goal is to prevent inference of sensitive values in the database from other related values. However, a crucial diﬀerence does exist between database inference control and association rule hiding. In database inference control, the focus is on hiding some of the entries, so that the privacy of other entries is preserved. In association rule hiding, the focus is on hiding the rules themselves, rather than the entries. Therefore, the privacy preservation process is applied on the output of the data mining algorithm, rather than the base data.

In association rule mining, a set of sensitive rules are specified by the system adminis-trator. The task is to mine all association rules, such that none of the sensitive rules are discovered, but all nonsensitive rules are discovered. Association rule hiding methods are either heuristic methods, border -based methods, or exact methods. In the first class of meth-ods, a subset of transactions are removed from the data. The association rules are discovered on the set of sanitized transactions. In general, if too many transactions are removed, then the remaining nonsensitive rules, which are discovered, will not reflect the true set of rules. This may lead to the discovery of rules that do not reflect the true patterns in the under-lying data. In the case of border-based methods, the border of the frequent pattern mining algorithm is adjusted, so as to discover only nonsensitive rules. Note that when the bor-ders of the frequent itemsets are adjusted, it will lead to the exclusion of nonsensitive rules along with the sensitive rules. The last class of problems formulates the hiding process as a constraint satisfaction problem. This formulation can be solved using integer programming. While these methods provide exact solutions, they are much slower, and their use is limited to problems of smaller size.

A related problem in output privacy is that of query auditing. In query auditing, the assumption is that users are allowed to issue a sequence of queries to the database. However, the response to one or more queries may sometimes lead to the compromising of sensitive information about smaller sets of individuals. Therefore, the responses to some of the queries are withheld (or audited) to prevent undesirable disclosure. The bibliographic notes contain specific pointers to a variety of query auditing and association rule hiding algorithms.

20.5. DISTRIBUTED PRIVACY				689

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 396 397 398 399 400 401 402 403 ... 423