Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə393/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   389   390   391   392   393   394   395   396   ...   423
1-Data Mining tarjima

Z3A3
















k ANONYMOUS SUB LATTICE
















SATISFYING

Z3A3







ON TWO ATTRIBUTES




























Z3A3










Z3A2

Z2A3

k ANONYMITY


























































Z3A2

Z2A3




Z3A2

Z2A3




Z3A1

Z2A2

Z1A3



















Z3A1




Z2A2

Z1A3































Z2A2

Z1A3


































Z3A0

Z2A1

Z1A2




Z0A3

Z2A1

Z1A2




Z0A3






















Z3A0










MINIMAL GENERALIZATIONS


































Z2A0

Z1A1

Z0A2
















SATISFYING k ANONYMITY




Z2A0




Z1A1

Z0A2











































Z1A0

Z0A1




Z1A0

Z0A1


































NOT SATISFYING













Z0A0










Z0A0

k ANONYMITY





































(a)

2-attribute lattice

(b) k-anonymous portion




Figure 20.3: Domain generalization hierarchies over combinations of attributes


this tuple-based domain generalization hierarchy that preserves k-anonymity with the least amount of generalization. After such a node < Ai, Zj > has been discovered, the privacy algorithm generalizes all ages to the level Ai and all ZIP codes to the level Zj .

In practice, some of the tuples may need to be suppressed in order to prevent undesirably high levels of generalization. This is because these may represent outlier tuples that cannot be incorporated in any group without significantly increasing the generalization level. For example, an individual with an age of 125 may need to be suppressed because of the outlier value of this attribute. Therefore, one of the parameters to the algorithm is a threshold M axSup, which specifies the maximum number of tuples that can be suppressed. The goal is therefore to discover a node that is as low as possible in the lattice of Fig. 20.3a, such that k-anonymity is satisfied after suppressing at most M axSup tuples. The height of a node in the lattice is defined as its path distance in the lattice from the most specific level of representation. In the example of Fig. 20.3, the height of node < Z i, Aj > is (i + j). A minimally generalized node may be defined as a node, for which the height is as small as possible. Therefore, in this example, one way of determining minimal generalizations, is to discover a k-anonymizable node < Zi, Aj >, such that the height (i + j) is as small as possible.





When there are d attributes < Qi1 . . . Qid >, the sum

d




k=1 ik over all attributes rep-




resents the height of that particular combination of generalizations. It is easy to see that any specialization of a node < Qi1 . . . Qid > that does not satisfy k-anonymity will also not satisfy k-anonymity. Similarly, any generalization of a node satisfying k-anonymity will also satisfy k- anonymity. Therefore, the subgraph of the lattice satisfying k-anonymity and the subgraph violating k-anonymity are both connected subgraphs, and a border can be constructed between them. An example of such a border1 is illustrated in Fig. 20.3b, and the corresponding minimal generalizations are illustrated in the same figure. Note that the minimal generalization is not unique, and that two possible minimal generalizations



  • Z2, A2 > and < Z1, A3 > are possible in this example. The reason for using minimally generalized nodes is to maximize the utility of the data for analytical algorithms. Other







  • This border is for illustration purposes only, and does not correspond to any data set in this chapter.

20.3. PRIVACY-PRESERVING DATA PUBLISHING

675

more refined definitions can be used for quantifying utility that use the distribution of the attribute values more explicitly. The bibliographic notes contain pointers to some of these definitions.


Samarati’s algorithm uses a simple binary search over the lattice of domain generaliza-tion tuples. Let [0, hmax] represent the range of heights of the lattice. It is then checked whether any of the generalizations at level hmax/2 satisfies the k-anonymity constraint. If this is indeed the case, then the height hmax/4 is checked. Otherwise, the height 3 · hmax/4 is checked. This approach is repeated, until the lowest height at which a k-anonymous solu-tion exists, is found. All the corresponding domain generalizations are reported, and any of these can be used for transforming the data. An important step in Samarati’s algorithm is the process of using the original database to check whether a particular node in the lattice satisfies k-anonymity. However, a discussion of this step is omitted here, because similar steps are discussed below in the context of the Incognito algorithm.


20.3.1.2 Incognito


The lattice of Fig. 20.3 shares a number of conceptual similarities with the lattice of frequent itemset mining algorithms, as discussed in Chap. 4. Therefore, some of the anonymization algorithms for discovering full-domain generalization also have similar characteristics to those of frequent itemset mining algorithms. The Incognito algorithm leverages a number of principles from frequent pattern mining to efficiently discover the k-anonymous portion of the lattice.


An important observation is that the size of the lattice is exponentially related to the number of quasi-identifiers. This can lead to increasing computational complexity in many practical scenarios. While it has been shown by Meyerson and Williams [385] that optimal k-anonymization is NP-hard, it is possible to reduce the computational burden by careful exploration of the lattice. The Incognito algorithm is based on the observation that the k-anonymity of a subset of generalized attributes is a necessary (but not sufficient) con-dition for the k -anonymity of a superset of attributes with matching generalization levels of the common elements. Henceforth, this property will be referred to as attribute subset closure. This property is a specific case of the generalization property which states that any generalization of a k-anonymous node in the lattice will always be k-anonymous.


These properties can be used to both generate candidates and prune the search process in a manner that is similar to the Apriori algorithm for frequent itemset mining. Therefore, nodes that are not k-anonymous with respect to a set of attributes, can be discarded, together with their specializations in the lattice hierarchy. Furthermore, generalizations of subsets of attributes that do satisfy the k-anonymity constraint, do not need to be checked because they are guaranteed to be k-anonymous.


The Incognito approach uses a levelwise approach, in which the following steps are repeated iteratively, until the k-anonymous sublattice containing all d attributes has been constructed. The set Fi denotes the set of all sublattices on i attributes that satisfies k-anonymity. The algorithm starts by initializing F1 to the portions of the single-attribute domain generalization hierarchies satisfying k-anonymity. This is quite simple, because sin-gle attribute hierarchies are paths. Thus, F1 is simply the top portion of the path, such that each generalized attribute value contains at least k tuples. Subsequently, as in frequent pattern mining, the algorithm repeatedly generates candidate sublattices in Ci+1 by joining sublattices in Fi that have exactly (i − 1) attributes in common. The process of joining two sublattices will be described later. Note that Ci+1 is a set of candidate sublattices on (i + 1) attributes. Each of these sublattices is then pruned of some of its nodes, using an


676 CHAPTER 20. PRIVACY-PRESERVING DATA MINING




Apriori-style approach. Specifically, nodes of sublattices in C i+1 whose generalizations are not k-anonymous in Fi can be pruned. This step will also be described in detail later.

After candidate generation and pruning, the portion of each sublattice that satisfies k-anonymity is retained by checking the constituent nodes against the base data records. Thus, each sublattice in Ci+1 reduces further in size. At this point, the set Ci+1 has been transformed to the set Fi+1. Thus, the following steps are repeated for increasing values of the index i:





  1. Generate Ci+1, the set of candidate sublattices on (i + 1) attributes. This is achieved by joining all pairs of k-anonymous sublattices in Fi that share (i − 1) attributes. The details of a join between a pair of sublattices will be described later.




  1. Prune the nodes from each sublattice in Ci+1 that cannot possibly satisfy k-anonymity by using the attribute subset closure property with respect to the set of k-anonymous combinations in Fi. The details of how the nodes may be pruned from a sublattice, will be described later.




  1. Check each node in each (already pruned) sublattice of Ci+1 against the base data, and remove those that do not satisfy k-anonymity. A node does not need to be checked, if one of its specializations already satisfies k-anonymity. This step transforms the set of candidate sublattices Ci+1 to the set of k-anonymous sublattices Fi+1 by removing the anonymity-violating sublattices.

If there are a total of d attributes, then the set Fd will contain a single sublattice of nodes satisfying k-anonymity. The nodes with the smallest height in this sublattice are reported. Note that the detailed implementation of the Incognito algorithm uses a slightly different approach for actually tracking the sublattices, by tracking the lattice nodes and edges in separate tables. The i- dimensional tables containing the generalization levels of lattice nodes of Fi are joined on their (i − 1) common attributes to create the (i + 1)-dimensional tables containing the nodes of C i+1. Subsequently, the lattice edges are added between the generated nodes based on the hierarchy relationships. Nevertheless, the simpler logical description provided here matches the Incognito algorithm.


Next, the details of the join and pruning operations will be discussed with the use of an example. In this case, three attributes will be used for greater clarity. As discussed earlier, let Ar and Zr represent different generalization levels of the age and ZIP code attributes, for varying values of the index r. Let Pr represent the generalization levels of an additional attribute corresponding to the profession. Higher values of the index r indicate a greater level of generalization. Consider the scenario where all three k-anonymous two-attribute sublattices on these three attributes are already available in F2. It is possible to use any pair of sublattices from these three possibilities, in order to perform the join. This will result in a candidate sublattice on all three attributes.


Consider the case, where the sublattices on (ZIP code, Age) and (ZIP code, Profession) are joined. The nodes in the new candidate sublattice will now have three attributes (ZIP code, Profession, Age) instead of two. The nodes for the new candidate sublattice are constructed by joining the nodes of the two k-anonymous sublattices. A pair of nodes





  • Zr, Aj > and < Zs, Pl > will be joined, if and only if r = s. In other words, the gen-eralization level of the ZIP code attribute needs to be the same in both cases. This will result in the new node < Zr, Pl, Aj >. In general, for pairs of nodes with k attributes, a join will be successfully executed, if and only if (a) they share (k − 1) attributes, and (b) the generalization levels of the (k − 1) common attributes are the same. An example of a join with two k-anonymous sublattices is illustrated in Fig. 20.4a.

20.3. PRIVACY-PRESERVING DATA PUBLISHING

677








Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   389   390   391   392   393   394   395   396   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin