Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	393/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 389 390 391 392 393 394 395 396 ... 423

1-Data Mining tarjima

SATISFYING Z 3 A 3
SATISFYING k ANONYMITY Z 2 A

Z₃A₃

k ANONYMOUS SUB LATTICE

SATISFYING

Z₃A₃

ON TWO ATTRIBUTES

Z₃A₃

^Z3^A2

^Z2^A3

k ANONYMITY

Z₃A₂

Z₂A₃

^Z3^A2

^Z2^A3

^Z3^A1

^Z2^A2

^Z1^A3

Z₃A₁

Z₂A₂

Z₁A₃

Z₂A₂

Z₁A₃

^Z3^A0

^Z2^A1

^Z1^A2

^Z0^A3

Z₂A₁

Z₁A₂

Z₀A₃

Z₃A₀

MINIMAL GENERALIZATIONS

Z₂A₀

Z₁A₁

Z₀A₂

SATISFYING k ANONYMITY

^Z2^A0

^Z1^A1

^Z0^A2

Z₁A₀

Z₀A₁

Z₁A₀

Z₀A₁

NOT SATISFYING

^Z0^A0

Z₀A₀

k ANONYMITY

(a)

2-attribute lattice

(b) k-anonymous portion

Figure 20.3: Domain generalization hierarchies over combinations of attributes

this tuple-based domain generalization hierarchy that preserves k-anonymity with the least amount of generalization. After such a node < A_i, Z_j > has been discovered, the privacy algorithm generalizes all ages to the level A_i and all ZIP codes to the level Z_j .

In practice, some of the tuples may need to be suppressed in order to prevent undesirably high levels of generalization. This is because these may represent outlier tuples that cannot be incorporated in any group without significantly increasing the generalization level. For example, an individual with an age of 125 may need to be suppressed because of the outlier value of this attribute. Therefore, one of the parameters to the algorithm is a threshold M axSup, which specifies the maximum number of tuples that can be suppressed. The goal is therefore to discover a node that is as low as possible in the lattice of Fig. 20.3a, such that k-anonymity is satisfied after suppressing at most M axSup tuples. The height of a node in the lattice is defined as its path distance in the lattice from the most specific level of representation. In the example of Fig. 20.3, the height of node < Z _i, A_j > is (i + j). A minimally generalized node may be defined as a node, for which the height is as small as possible. Therefore, in this example, one way of determining minimal generalizations, is to discover a k-anonymizable node < Z_i, A_j >, such that the height (i + j) is as small as possible.

When there are d attributes < Q_i₁ . . . Q_i_d >, the sum	d
	_k₌₁i_k over all attributes rep-

resents the height of that particular combination of generalizations. It is easy to see that any specialization of a node < Q_i₁ . . . Q_i_d > that does not satisfy k-anonymity will also not satisfy k-anonymity. Similarly, any generalization of a node satisfying k-anonymity will also satisfy k- anonymity. Therefore, the subgraph of the lattice satisfying k-anonymity and the subgraph violating k-anonymity are both connected subgraphs, and a border can be constructed between them. An example of such a border1 is illustrated in Fig. 20.3b, and the corresponding minimal generalizations are illustrated in the same figure. Note that the minimal generalization is not unique, and that two possible minimal generalizations

Z₂, A₂ > and < Z₁, A₃ > are possible in this example. The reason for using minimally generalized nodes is to maximize the utility of the data for analytical algorithms. Other

This border is for illustration purposes only, and does not correspond to any data set in this chapter.

20.3. PRIVACY-PRESERVING DATA PUBLISHING

675

more refined definitions can be used for quantifying utility that use the distribution of the attribute values more explicitly. The bibliographic notes contain pointers to some of these definitions.

Samarati’s algorithm uses a simple binary search over the lattice of domain generaliza-tion tuples. Let [0, h_max] represent the range of heights of the lattice. It is then checked whether any of the generalizations at level h_max/2 satisfies the k-anonymity constraint. If this is indeed the case, then the height h_max/4 is checked. Otherwise, the height 3 · h_max/4 is checked. This approach is repeated, until the lowest height at which a k-anonymous solu-tion exists, is found. All the corresponding domain generalizations are reported, and any of these can be used for transforming the data. An important step in Samarati’s algorithm is the process of using the original database to check whether a particular node in the lattice satisfies k-anonymity. However, a discussion of this step is omitted here, because similar steps are discussed below in the context of the Incognito algorithm.

20.3.1.2 Incognito

The lattice of Fig. 20.3 shares a number of conceptual similarities with the lattice of frequent itemset mining algorithms, as discussed in Chap. 4. Therefore, some of the anonymization algorithms for discovering full-domain generalization also have similar characteristics to those of frequent itemset mining algorithms. The Incognito algorithm leverages a number of principles from frequent pattern mining to eﬃciently discover the k-anonymous portion of the lattice.

An important observation is that the size of the lattice is exponentially related to the number of quasi-identifiers. This can lead to increasing computational complexity in many practical scenarios. While it has been shown by Meyerson and Williams [385] that optimal k-anonymization is NP-hard, it is possible to reduce the computational burden by careful exploration of the lattice. The Incognito algorithm is based on the observation that the k-anonymity of a subset of generalized attributes is a necessary (but not suﬃcient) con-dition for the k -anonymity of a superset of attributes with matching generalization levels of the common elements. Henceforth, this property will be referred to as attribute subset closure. This property is a specific case of the generalization property which states that any generalization of a k-anonymous node in the lattice will always be k-anonymous.

These properties can be used to both generate candidates and prune the search process in a manner that is similar to the Apriori algorithm for frequent itemset mining. Therefore, nodes that are not k-anonymous with respect to a set of attributes, can be discarded, together with their specializations in the lattice hierarchy. Furthermore, generalizations of subsets of attributes that do satisfy the k-anonymity constraint, do not need to be checked because they are guaranteed to be k-anonymous.

The Incognito approach uses a levelwise approach, in which the following steps are repeated iteratively, until the k-anonymous sublattice containing all d attributes has been constructed. The set F_i denotes the set of all sublattices on i attributes that satisfies k-anonymity. The algorithm starts by initializing F₁ to the portions of the single-attribute domain generalization hierarchies satisfying k-anonymity. This is quite simple, because sin-gle attribute hierarchies are paths. Thus, F₁ is simply the top portion of the path, such that each generalized attribute value contains at least k tuples. Subsequently, as in frequent pattern mining, the algorithm repeatedly generates candidate sublattices in C_i₊₁ by joining sublattices in F_i that have exactly (i − 1) attributes in common. The process of joining two sublattices will be described later. Note that C_i₊₁ is a set of candidate sublattices on (i + 1) attributes. Each of these sublattices is then pruned of some of its nodes, using an

676 CHAPTER 20. PRIVACY-PRESERVING DATA MINING

Apriori-style approach. Specifically, nodes of sublattices in C _i₊₁ whose generalizations are not k-anonymous in F_i can be pruned. This step will also be described in detail later.

After candidate generation and pruning, the portion of each sublattice that satisfies k-anonymity is retained by checking the constituent nodes against the base data records. Thus, each sublattice in C_i₊₁ reduces further in size. At this point, the set C_i₊₁ has been transformed to the set F_i₊₁. Thus, the following steps are repeated for increasing values of the index i:

Generate C_i₊₁, the set of candidate sublattices on (i + 1) attributes. This is achieved by joining all pairs of k-anonymous sublattices in F_i that share (i − 1) attributes. The details of a join between a pair of sublattices will be described later.

Prune the nodes from each sublattice in C_i₊₁ that cannot possibly satisfy k-anonymity by using the attribute subset closure property with respect to the set of k-anonymous combinations in F_i. The details of how the nodes may be pruned from a sublattice, will be described later.

Check each node in each (already pruned) sublattice of C_i₊₁ against the base data, and remove those that do not satisfy k-anonymity. A node does not need to be checked, if one of its specializations already satisfies k-anonymity. This step transforms the set of candidate sublattices C_i₊₁ to the set of k-anonymous sublattices F_i₊₁ by removing the anonymity-violating sublattices.

If there are a total of d attributes, then the set F_d will contain a single sublattice of nodes satisfying k-anonymity. The nodes with the smallest height in this sublattice are reported. Note that the detailed implementation of the Incognito algorithm uses a slightly diﬀerent approach for actually tracking the sublattices, by tracking the lattice nodes and edges in separate tables. The i- dimensional tables containing the generalization levels of lattice nodes of F_i are joined on their (i − 1) common attributes to create the (i + 1)-dimensional tables containing the nodes of C _i₊₁. Subsequently, the lattice edges are added between the generated nodes based on the hierarchy relationships. Nevertheless, the simpler logical description provided here matches the Incognito algorithm.

Next, the details of the join and pruning operations will be discussed with the use of an example. In this case, three attributes will be used for greater clarity. As discussed earlier, let A_r and Z_r represent diﬀerent generalization levels of the age and ZIP code attributes, for varying values of the index r. Let P_r represent the generalization levels of an additional attribute corresponding to the profession. Higher values of the index r indicate a greater level of generalization. Consider the scenario where all three k-anonymous two-attribute sublattices on these three attributes are already available in F₂. It is possible to use any pair of sublattices from these three possibilities, in order to perform the join. This will result in a candidate sublattice on all three attributes.

Consider the case, where the sublattices on (ZIP code, Age) and (ZIP code, Profession) are joined. The nodes in the new candidate sublattice will now have three attributes (ZIP code, Profession, Age) instead of two. The nodes for the new candidate sublattice are constructed by joining the nodes of the two k-anonymous sublattices. A pair of nodes

Z_r, A_j > and < Z_s, P_l > will be joined, if and only if r = s. In other words, the gen-eralization level of the ZIP code attribute needs to be the same in both cases. This will result in the new node < Z_r, P_l, A_j >. In general, for pairs of nodes with k attributes, a join will be successfully executed, if and only if (a) they share (k − 1) attributes, and (b) the generalization levels of the (k − 1) common attributes are the same. An example of a join with two k-anonymous sublattices is illustrated in Fig. 20.4a.

20.3. PRIVACY-PRESERVING DATA PUBLISHING							677

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 389 390 391 392 393 394 395 396 ... 423