records belonging to the r different values of the sensitive attribute in an equivalence class, such that p1 ≥ p2 ≥ . . . ≥ pr. The equivalence class satisfies recursive (c, )-diversity, if the following is true:
i=
684 CHAPTER 20. PRIVACY-PRESERVING DATA MINING
An anonymized table is said to satisfy recursive (c, )-diversity, if each equivalence class in it satisfies entropy (c, )-diversity.
The idea is that the least frequent tail of the sensitive attribute values must contain sufficient cumulative frequency compared to the most frequent sensitive attribute value. The value of r has to be at least , for the right-hand side of the aforementioned relationship to be non-zero.
A key property of -diversity is that any generalization of an -diverse table is also -diverse. This is true for both definitions of -diversity.
Lemma 20.3.1 (Entropy -diversity monotonicity) If a table is entropy -diverse, then any generalization of the table is entropy -diverse as well.
Lemma 20.3.2 (Recursive (c, )-diversity monotonicity) If a table is recursive (c, )-diverse, then any generalization of the table is recursive (c, )-diverse as well.
The reader is advised to work out Exercises 9(a) and (b), which are related to these results. Thus, - diversity exhibits the same monotonicity property exhibited by k -anonymity algo-rithms. This implies that the algorithms for k-anonymity can be easily generalized to - diversity by making minor modifications. For example, both Samarati’s algorithm and the Incognito algorithm can be adapted to the -diversity definition. The only change to any k-anonymity algorithm is as follows. Every time a table is tested for k-anonymity, it is now tested for -diversity instead. Therefore, algorithmic development of -diverse anonymization methods is typically executed by simply adapting existing k-anonymization algorithms.
20.3.3 The t-closeness Model
While the -diversity model is effective in preventing direct inference of sensitive attributes, it does not fully prevent the gain of some knowledge by an adversary. The primary reason for this is that -diversity does not account for the distribution of the sensitive attribute values in the original table. For example, the entropy of a set of sensitive attribute values with relative frequencies p1 . . . pr will take on the maximum value when p1 = p2 = . . . = pr = 1/r. Unfortunately, this can often represent a serious breach of privacy, when there is a significant skew in the original distribution of sensitive attribute values. Consider the example of a medical database of HIV tests, where the sensitive value takes on the two values of “HIV” or “normal,” with relative proportions of 1 : 99. In this case, a group with an equal distribution of HIV and normal patients will have the highest entropy, based on the -diversity definition.
Unfortunately, such a distribution is highly revealing when the distribution of the sensi-tive values in the original data is taken into account. Sensitive values are usually distributed in a skewed way, across most real data sets. In the medical example discussed above, it is already known that only 1 % of the patients in the entire data set have HIV. Thus, the equal distribution of HIV-infected and normal patients within a group, provides a signifi-cant information gain to the adversary. The adversary now knows, that this small group of
Dostları ilə paylaş: |