Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	392/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 388 389 390 391 392 393 394 395 ... 423

1-Data Mining tarjima

ZIP CODE

(0, 100]

ALL

(0, 20]

(20, 40]

(40, 60]

N.E. US MID W. US

W. US

(0, 10]

(10, 20]

(20, 30]

(30, 40]

36 37

10547

10598

01239

90210

90345

89119

Figure 20.1: A value- and corresponding domain-generalization hierarchy for the age and ZIP code attributes

Table 20.3: Example of a 3-anonymized version of Table 20.1

Row Index	Age	ZIP Code	Disease


1	[20, 30]	Northeastern US	HIV
2	[30, 40]	Western US	Hepatitis C
3	[20, 30]	Northeastern US	HIV
4	[30, 40]	Western US	Hepatitis C
5	[30, 40]	Western US	Diabetes
6	[20, 30]	Northeastern US	HIV

aggregate distribution approach of randomization because the probability distribution is data-record specific, and is designed to ensure k-anonymity. While this approach has not been studied intensively, it has the potential to allow the use of recent advances in the field of probabilistic databases for anonymization.

Among the aforementioned methods, the generalization and suppression methods are most commonly used for anonymization. Therefore, most of the discussion in this section will be focused on these methods. First, the notion of k-anonymity will be defined.

Definition 20.3.1 (k-anonymity) A data set is said to be k-anonymized, if the attributes of each record in the anonymized data set cannot be distinguished from at least (k − 1) other data records.

This group of indistinguishable data records is also referred to as an equivalence class. To understand how generalization and suppression can be used for anonymization, consider the data set in Table 20.1. An example of a 3-anonymized version of this table is illustrated in Table 20.3. The SSN has been fully suppressed with column -wise suppression and replaced with an anonymized row index. Such explicit identifiers are almost always fully suppressed in anonymization. The two publicly available attributes corresponding to the age and ZIP code are now generalized and specified approximately. The subjects of the row indices 1, 3, and 6 can no longer be distinguished by using linkage attacks because their publicly available attributes are identical. Similarly, the publicly available attributes of row indices 2, 4, and 5 are identical. Thus, this table contains two equivalence classes containing three records each, and the data records cannot be distinguished from one another within these

672		CHAPTER 20. PRIVACY-PRESERVING DATA MINING
							ZIP CODE
					Z4		ALL
			Z3	1****				9****
	Z2		10***					90***
	Z1		105**			902**		903**
Z0	10547 10598			90210				90345

Figure 20.2: An alternate value- and corresponding domain-generalization hierarchy for the ZIP code attribute

equivalence classes. In other words, an adversary can no longer match the identification of individual data records with voter registration rolls exactly. If any matching is found, then it is guaranteed that at least k = 3 records in the data set will match any particular individual in the voter registration roll.

The ZIP code is generalized with the use of the prespecified value generalization hierarchy of Fig. 20.1. The generation of a domain generalization hierarchy for a categorical attribute can be done in several ways, and depends on the skill of the analyst responsible for the privacy modifications. An alternate example of a domain generalization hierarchy for the ZIP code attribute is illustrated in Fig. 20.2. A value generalization hierarchy on the continuous attributes does not require any special domain knowledge because it can be directly created by the analyst, using the actual distribution of the continuous values in the underlying data. This requires a simple hierarchical discretization of the continuous attributes.

The goal of the privacy -preservation algorithms is to replace the original values in the data (numeric or discrete), with one of the discrete values illustrated in the taxonomy trees of Fig. 20.1. Thus, the data is recoded in terms of a new set of discrete values. In most cases, the numeric attributes do retain their ordering, because the corresponding ranges are ordered. Diﬀerent algorithms use diﬀerent rules in the recoding process. These diﬀerent ways of recoding attributes may be distinguished as follows:

Global versus local recoding: In global recoding, a given attribute value is always replaced with the same discrete counterpart from the domain generalization hierarchy over all data records. Consider the aforementioned example of Fig. 20.1, in which ZIP code can be generalized either to state or region. In global recoding, the particular ZIP code value of 10547 needs to be consistently replaced by either Northeastern US, or New York over all the data records. However, for a diﬀerent ZIP code such as 90210, a diﬀerent level of hierarchy may be selected than for the 10547 value, as long as it is done consistently for a particular data value (e.g., 10547 or 90210) across all data records. In local recoding, diﬀerent data records may use diﬀerent generalizations for the same data value. For example, one data record might use Northeastern US, whereas another data record might use New York for 10547. While local recoding might seem to be better optimized, because of its greater flexibility, it does lose a diﬀerent kind of information. In particular, because the same ZIP code might map to diﬀerent values, such as New York and Northeastern US, the similarity computation

20.3. PRIVACY-PRESERVING DATA PUBLISHING

673

between the resulting data records may be less accurate. Most of the current privacy schemes use global recoding.

Full-domain generalization: Full-domain generalization is a special case of global recoding. In this approach, all values of a particular attribute are generalized to the same level of the taxonomy. For example, a ZIP code might be generalized to its state for all instances of the attribute. In other words, if the ZIP code 10547 is generalized to New York, then the ZIP code 90210 must be generalized to California. The various hierarchical alternatives for full-domain generalization of the age attribute are denoted by A₀, A₁, A₂, and A₃ in Fig. 20.1. The possible full-domain generalization levels of the ZIP code are denoted by Z₀, Z₁, Z₂, and Z₃. In this case, Z₃ represents the high-est level of generalization (column suppression), and Z₀ represents the original values of the ZIP code attribute. Thus, once it is decided that the anonymization algorithm should use Z₂ for the ZIP code attribute, then every instance of the ZIP code attribute (Z₀) in the data set is replaced with its generalized value in Z₂. This is the reason that the approach is referred to as full-domain generalization, as the entire domain of data values for a particular attribute is generalized to the same level of the hierarchy. Full-domain generalization is the most common approach used in privacy-preserving data publishing.

Full-domain generalization is intuitively appealing because it ensures that the diﬀerent values of an attribute have the same level of granularity throughout the data set. The earliest methods, such as Samarati’s original algorithm, and Incognito, were all full-domain generalization algorithms.

20.3.1.1 Samarati’s Algorithm

Samarati’s algorithm was first proposed in the context of the definition of k-anonymity. Samarati’s original AG-TS (Attribute Generalization and Tuple Suppression) algorithm for k-anonymity provides the basic domain generalization framework, which is the basis for group-based anonymization. It has already been discussed, how the domain generalization of a single attribute can be represented as a path. For example, the path from Z ₀ to Z₃ in Fig. 20.1 represents the generalization of the ZIP code attribute. The notion of domain generalization can also be defined for combinations of attributes. However, in the case of attribute combinations, the relationships are no longer expressed as a path, but as a special kind of directed acyclic graph, known as a lattice. In this case, each node specifies a (full-domain) generalization level for the diﬀerent attributes For example, < A₁, Z₂ > denotes the domain generalization level of age to A₁ and ZIP code to Z₂. In other words, every data record is generalized to the level < A₁, Z₂ >. Note that < A₁ , Z₂ > also represents the generalization level of the (anonymized) Table 20.3 based on the domain-generalization hierarchies specified in Fig. 20.1.

Thus, each node in the lattice specifies a possible level of full-domain generalization, in terms of which the original data is represented. The edges in this graph represent the direct generalization relationships among these tuples of domains. A directed path in the lattice, from lower to higher levels, represents a sequence of generalizations. Conversely, a lower-level node is a specialization of a higher-level node. For example, the node < A₁, Z₁ > is a direct specialization of either < A₁, Z₂ > , or < A₂, Z₁ > because a single attribute in either can be specialized once to immediately yield < A₁, Z₁ >. An example of the domain generalization hierarchy for the age and ZIP code combination is illustrated in Fig. 20.3a.

The goal of the full-domain anonymization algorithm is to discover the node < A_i, Z_j > in

674

CHAPTER 20. PRIVACY-PRESERVING DATA MINING

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 388 389 390 391 392 393 394 395 ... 423