Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə394/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   390   391   392   393   394   395   396   397   ...   423
1-Data Mining tarjima

Z3A3




Z3P3A3






















Z3A2

Z2A3
















Z3P3A2 Z3P2A3

Z2P3A3







Z2A2

Z1A3













+

JOIN

Z3P1A3

Z2P3A2Z2P2A3







Z3P3

Z3P2A2



















Z3P2

Z2P3

Z3P1A2

Z2P2A2




Z3P1

Z2P2




CANDIDATE SUB LATTICE







ON THREE ATTRIBUTES




k ANONYMOUS SUB LATTICES










ON TWO ATTRIBUTES













  1. Incognito join between two sub-lattices



Z3P3A3


















Z3P3A3







Z3P3A2 Z3P2A3

Z2P3A3




P3A3






















Z3P3A2

Z3P2A3 Z2P3A3
















P3A2

P2A3










Z3P2A2

Z3P1A3

Z2P3A2

Z2P2A3






















PRUNE WITH k ANONYMOUS

Z2P3A2

Z2P2A3













SUB LATTICE ON TWO ATTRIBUTES

PRUNED CANDIDATE ON

























Z3P1A2

Z2P2A2







THREE ATTRIBUTES






























CANDIDATE SUB LATTICE


ON THREE ATTRIBUTES



  1. Incognito pruning

Figure 20.4: Incognito joins and pruning


In the previous example, the sublattice for the profession–age combination was not used for the join. However, it is still useful for pruning. This is because, if a node < Pi, Aj > is not present in this sublattice, then any node of the form < Zm , Pi, Aj > will also not be k-anonymous. Therefore, such nodes can be removed from the constructed candidate sublattice together with their specializations. An example of a pruning step on the candidate sublattice is illustrated in Fig. 20.4b. This pruning is based on the attribute-subset closure property, and it is reminiscent of Apriori pruning in frequent itemset mining. As in the case of frequent itemset mining, all k-attribute subsets of each candidate (k + 1)- sublattice in Ck +1 need to be checked. If a node violates the closure property in any of these checks, then it is pruned.

Finally, the generated nodes in Ck+1 need to checked against the original database to determine whether they satisfy k-anonymity. For example, in order to determine whether





  • Z1, A1 > satisfies k-anonymity based on the value generalization in Fig. 20.1, one needs to determine the number of individuals satisfying each of the pairs of condi-tions such as (ZIP code NY, 0 < Age 10), (ZIP code NY, 10 < Age 20), (ZIP code MA, 0 < Age 10), and so on. Therefore, for each node in the lattice,

678 CHAPTER 20. PRIVACY-PRESERVING DATA MINING


a vector of frequency values need to be computed. This vector is also referred to as a frequency vector or frequency set. The process of frequency vector computation can be expensive because the original database may need to be scanned to determine the number of tuples satisfying these conditions. However, several strategies can be used to reduce the burden of computation. For example, if the frequency vector of < Z1, A1 > has already been computed, one can use roll -up to directly compute the frequency vectors of the generalization < Z2, A1 > without actually scanning the database. This is because the frequency of the set (ZIP code Northeastern US, 0 < Age 10) is the sum of the frequencies of (ZIP code NY, 0 < Age 10), (ZIP code NJ, 0 < Age 10), (ZIP code MA, 0 < Age 10), and so on. The simplest approach is to use a breadth-first strategy on the lattice of each set of (k + 1) attributes, by determining the frequency vectors of specific (lower-level) nodes in the lattice before determining the frequency vectors of more general (higher-level) nodes. The frequency vectors of higher-level nodes can be computed efficiently from those of lower-level nodes by using the roll-up property.


Note that a separate breadth-first search needs to be performed for each subset of (k + 1) attributes in Ck +1 to compute its frequency vectors. Furthermore, once a node has been identified by the breadth-first search to be k-anonymous, its generalizations in the lattice are guaranteed to be k-anonymous. Therefore, they are automatically marked as k-anonymous and are not explicitly checked. The original algorithm also supports a number of other optimizations, referred to as Incognito super-roots and Bottom-up precomputation. The bibliographic notes contain pointers to these methods.


20.3.1.3 Mondrian Multidimensional k-Anonymity


One of the disadvantages of the methods discussed so far is that the domain generalization hierarchies for various attributes are constructed independently as a preprocessing step. Thus, after the hierarchical discretization (domain generalization) for a numeric attribute has been fixed by the preprocessing step, it is utilized by the anonymization algorithm. This rigidity in the anonymization process creates inefficiencies in data representation, when the various data attributes are correlated in multidimensional space. For example, the salary distribution for older individuals may be different from that of younger individuals. A pre-processed domain generalization hierarchy is unable to adjust to such attribute correlations in the data set. In general, the best trade-offs between privacy and utility are achieved when the multidimensional relationships among data points are leveraged in the anonymization process. In other words, the attribute ranges for each attribute in a data point X should be generated in a dynamic way depending on the specific multidimensional locality of X.


The Mondrian method generates multidimensional rectangular regions, containing at least k data points. This is achieved by recursively dividing the bounding boxes with axis-parallel cuts, until each region contains no more than k data points. This approach is not very different from the methodology used by many traditional index structures, such as k d-trees. An example of the partitioning induced by the Mondrian algorithm is illustrated in Fig. 20.5. In this case, a 5-anonymous partitioning is illustrated. Thus, each group contains at least five data points. It is easy to see that the same attribute value is represented by different ranges in different portions of the data, in order to account for the varying density of different regions. It is this flexibility that gives Mondrian a more compact representation of the anonymized groups than the other methods.


The Mondrian algorithm dynamically maintains the set B of multidimensional gen-eralizations that satisfy k-anonymity and cover the data set. The Mondrian algorithm starts with a rectangular box B of all the data points. This represents the generalization



20.3. PRIVACY-PRESERVING DATA PUBLISHING

679







Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   390   391   392   393   394   395   396   397   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin