Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	397/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 393 394 395 396 397 398 399 400 ... 423

1-Data Mining tarjima

identity disclosure, it does not prevent attribute disclosure.

The main reason for this breach is that the sensitive information is not diverse enough within the anonymized groups. Since the goal of privacy -preserving data publishing is to prevent the revelation of sensitive information, a model that does not use the sensitive

20.3. PRIVACY-PRESERVING DATA PUBLISHING

683

attribute values within the group formation process, cannot achieve this goal. The -diversity model is designed to ensure that the sensitive attributes within an equivalence class are suﬃciently diverse.

Definition 20.3.2 (-diversity Principle) An equivalence class is said to be diverse, if it contains “well-represented” values for the sensitive attribute. An anonymized table is said to be -diverse, if each equivalence class in it is -diverse.

It is important to note that the notion of “well represented” can be instantiated in several diﬀerent ways. Therefore, the aforementioned definition provides the basic principle behind this approach, but cannot be considered a hard definition. There are several ways in which the notion of “well-represented” can be instantiated. These correspond to the notions of entropy -diversity and recursive -diversity. These definitions are described below.

Definition 20.3.3 (Entropy -diversity) Let p₁ . . . p_r be the fraction of the data records belonging to diﬀerent values of the sensitive attribute in an equivalence class. The equivalence class is said to be entropy -diverse, if the entropy of its sensitive attribute value distribution is at least log().

	r
−	p_i · log(p_i) ≥ log()	(20.10)

i=1

An anonymized table is said to satisfy entropy -diversity, if each equivalence class in it satisfies entropy -diversity.

It can be shown that the sensitive attributes in an equivalence class must have at least distinct values for the table to be -diverse (see Exercise 7). Therefore, any -diverse group has at least elements, and is -anonymous as well.

One problem with this definition of -diversity is that it may be too restrictive in many settings, especially when the distributions of the sensitive attribute values are uneven. The entropy of a table can be shown to be at least equal to the minimum entropy of the con-stituent equivalence classes into which it is partitioned (see Exercise 8). Therefore, to ensure -diversity of each equivalence class, the sensitive attribute distribution in the entire table must also be -diverse. This is a restrictive assumption in many settings, because most real distributions of sensitive attributes are very skewed. For example, in a medical application, the sensitive (disease) attribute is likely to have uneven frequencies between normal indi-viduals and various diseases. Greater attribute skew reduces the (global) entropy -diversity of the sensitive-attribute distribution across the entire table. When this global -diversity is less than , it is no longer possible to create a globally -diverse partition without suppressing many data records.

Therefore, a more relaxed notion of recursive (c, )-diversity has been proposed. The basic goal of the definition is to ensure that the most frequent attribute value in an equivalence class does not dominate the less frequent sensitive values in it. An additional parameter c is used to control the relative frequency of the diﬀerent values of the sensitive attribute within an equivalence class.

Definition 20.3.4 (Recursive (c, )-diversity) Let p₁ . . . p_r be the fraction of the data

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 393 394 395 396 397 398 399 400 ... 423