ZIP CODE
|
|
|
|
|
|
A3
|
|
|
(0, 100]
|
|
|
|
|
Z3
|
|
ALL
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A2
|
(0, 20]
|
|
(20, 40]
|
|
(40, 60]
|
|
Z2
|
N.E. US MID W. US
|
W. US
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A1
|
(0, 10]
|
(10, 20]
|
(20, 30]
|
(30, 40]
|
|
Z1
|
NY
|
|
MA
|
CA
|
|
NV
|
|
A0
|
|
|
|
|
|
|
|
Z0
|
|
|
|
|
|
|
|
|
24
|
25
|
26
|
36 37
|
38
|
10547
|
10598
|
01239
|
90210
|
90345
|
89119
|
|
Figure 20.1: A value- and corresponding domain-generalization hierarchy for the age and ZIP code attributes
Table 20.3: Example of a 3-anonymized version of Table 20.1
|
Row Index
|
Age
|
ZIP Code
|
Disease
|
|
|
|
|
|
|
|
|
|
|
|
1
|
[20, 30]
|
Northeastern US
|
HIV
|
|
2
|
[30, 40]
|
Western US
|
Hepatitis C
|
|
3
|
[20, 30]
|
Northeastern US
|
HIV
|
|
4
|
[30, 40]
|
Western US
|
Hepatitis C
|
|
5
|
[30, 40]
|
Western US
|
Diabetes
|
|
6
|
[20, 30]
|
Northeastern US
|
HIV
|
|
|
|
|
|
aggregate distribution approach of randomization because the probability distribution is data-record specific, and is designed to ensure k-anonymity. While this approach has not been studied intensively, it has the potential to allow the use of recent advances in the field of probabilistic databases for anonymization.
Among the aforementioned methods, the generalization and suppression methods are most commonly used for anonymization. Therefore, most of the discussion in this section will be focused on these methods. First, the notion of k-anonymity will be defined.
Definition 20.3.1 (k-anonymity) A data set is said to be k-anonymized, if the attributes of each record in the anonymized data set cannot be distinguished from at least (k − 1) other data records.
This group of indistinguishable data records is also referred to as an equivalence class. To understand how generalization and suppression can be used for anonymization, consider the data set in Table 20.1. An example of a 3-anonymized version of this table is illustrated in Table 20.3. The SSN has been fully suppressed with column -wise suppression and replaced with an anonymized row index. Such explicit identifiers are almost always fully suppressed in anonymization. The two publicly available attributes corresponding to the age and ZIP code are now generalized and specified approximately. The subjects of the row indices 1, 3, and 6 can no longer be distinguished by using linkage attacks because their publicly available attributes are identical. Similarly, the publicly available attributes of row indices 2, 4, and 5 are identical. Thus, this table contains two equivalence classes containing three records each, and the data records cannot be distinguished from one another within these
672
|
CHAPTER 20. PRIVACY-PRESERVING DATA MINING
|
|
|
|
|
ZIP CODE
|
|
|
|
Z4
|
ALL
|
|
|
Z3
|
1****
|
|
9****
|
|
Z2
|
10***
|
|
|
90***
|
|
Z1
|
105**
|
|
902**
|
903**
|
Z0
|
10547 10598
|
90210
|
|
90345
|
Figure 20.2: An alternate value- and corresponding domain-generalization hierarchy for the ZIP code attribute
equivalence classes. In other words, an adversary can no longer match the identification of individual data records with voter registration rolls exactly. If any matching is found, then it is guaranteed that at least k = 3 records in the data set will match any particular individual in the voter registration roll.
The ZIP code is generalized with the use of the prespecified value generalization hierarchy of Fig. 20.1. The generation of a domain generalization hierarchy for a categorical attribute can be done in several ways, and depends on the skill of the analyst responsible for the privacy modifications. An alternate example of a domain generalization hierarchy for the ZIP code attribute is illustrated in Fig. 20.2. A value generalization hierarchy on the continuous attributes does not require any special domain knowledge because it can be directly created by the analyst, using the actual distribution of the continuous values in the underlying data. This requires a simple hierarchical discretization of the continuous attributes.
The goal of the privacy -preservation algorithms is to replace the original values in the data (numeric or discrete), with one of the discrete values illustrated in the taxonomy trees of Fig. 20.1. Thus, the data is recoded in terms of a new set of discrete values. In most cases, the numeric attributes do retain their ordering, because the corresponding ranges are ordered. Different algorithms use different rules in the recoding process. These different ways of recoding attributes may be distinguished as follows:
Global versus local recoding: In global recoding, a given attribute value is always replaced with the same discrete counterpart from the domain generalization hierarchy over all data records. Consider the aforementioned example of Fig. 20.1, in which ZIP code can be generalized either to state or region. In global recoding, the particular ZIP code value of 10547 needs to be consistently replaced by either Northeastern US, or New York over all the data records. However, for a different ZIP code such as 90210, a different level of hierarchy may be selected than for the 10547 value, as long as it is done consistently for a particular data value (e.g., 10547 or 90210) across all data records. In local recoding, different data records may use different generalizations for the same data value. For example, one data record might use Northeastern US, whereas another data record might use New York for 10547. While local recoding might seem to be better optimized, because of its greater flexibility, it does lose a different kind of information. In particular, because the same ZIP code might map to different values, such as New York and Northeastern US, the similarity computation
20.3. PRIVACY-PRESERVING DATA PUBLISHING
|
673
|
between the resulting data records may be less accurate. Most of the current privacy schemes use global recoding.
Full-domain generalization: Full-domain generalization is a special case of global recoding. In this approach, all values of a particular attribute are generalized to the same level of the taxonomy. For example, a ZIP code might be generalized to its state for all instances of the attribute. In other words, if the ZIP code 10547 is generalized to New York, then the ZIP code 90210 must be generalized to California. The various hierarchical alternatives for full-domain generalization of the age attribute are denoted by A0, A1, A2, and A3 in Fig. 20.1. The possible full-domain generalization levels of the ZIP code are denoted by Z0, Z1, Z2, and Z3. In this case, Z3 represents the high-est level of generalization (column suppression), and Z0 represents the original values of the ZIP code attribute. Thus, once it is decided that the anonymization algorithm should use Z2 for the ZIP code attribute, then every instance of the ZIP code attribute (Z0) in the data set is replaced with its generalized value in Z2. This is the reason that the approach is referred to as full-domain generalization, as the entire domain of data values for a particular attribute is generalized to the same level of the hierarchy. Full-domain generalization is the most common approach used in privacy-preserving data publishing.
Full-domain generalization is intuitively appealing because it ensures that the different values of an attribute have the same level of granularity throughout the data set. The earliest methods, such as Samarati’s original algorithm, and Incognito, were all full-domain generalization algorithms.
20.3.1.1 Samarati’s Algorithm
Samarati’s algorithm was first proposed in the context of the definition of k-anonymity. Samarati’s original AG-TS (Attribute Generalization and Tuple Suppression) algorithm for k-anonymity provides the basic domain generalization framework, which is the basis for group-based anonymization. It has already been discussed, how the domain generalization of a single attribute can be represented as a path. For example, the path from Z 0 to Z3 in Fig. 20.1 represents the generalization of the ZIP code attribute. The notion of domain generalization can also be defined for combinations of attributes. However, in the case of attribute combinations, the relationships are no longer expressed as a path, but as a special kind of directed acyclic graph, known as a lattice. In this case, each node specifies a (full-domain) generalization level for the different attributes For example, < A1, Z2 > denotes the domain generalization level of age to A1 and ZIP code to Z2. In other words, every data record is generalized to the level < A1, Z2 >. Note that < A1 , Z2 > also represents the generalization level of the (anonymized) Table 20.3 based on the domain-generalization hierarchies specified in Fig. 20.1.
Thus, each node in the lattice specifies a possible level of full-domain generalization, in terms of which the original data is represented. The edges in this graph represent the direct generalization relationships among these tuples of domains. A directed path in the lattice, from lower to higher levels, represents a sequence of generalizations. Conversely, a lower-level node is a specialization of a higher-level node. For example, the node < A1, Z1 > is a direct specialization of either < A1, Z2 > , or < A2, Z1 > because a single attribute in either can be specialized once to immediately yield < A1, Z1 >. An example of the domain generalization hierarchy for the age and ZIP code combination is illustrated in Fig. 20.3a.
The goal of the full-domain anonymization algorithm is to discover the node < Ai, Zj > in
674
|
|
|
|
CHAPTER 20. PRIVACY-PRESERVING DATA MINING
|
|
|
|
|
Dostları ilə paylaş: |