Z3A3
|
|
Z3P3A3
|
|
|
|
|
|
|
|
Z3A2
|
Z2A3
|
|
|
|
|
|
Z3P3A2 Z3P2A3
|
Z2P3A3
|
|
|
Z2A2
|
Z1A3
|
|
|
|
|
+
|
JOIN
|
Z3P1A3
|
Z2P3A2Z2P2A3
|
|
|
Z3P3
|
Z3P2A2
|
|
|
|
|
|
|
Z3P2
|
Z2P3
|
Z3P1A2
|
Z2P2A2
|
|
Z3P1
|
Z2P2
|
|
CANDIDATE SUB LATTICE
|
|
|
ON THREE ATTRIBUTES
|
|
k ANONYMOUS SUB LATTICES
|
|
|
|
ON TWO ATTRIBUTES
|
|
|
|
Incognito join between two sub-lattices
Z3P3A3
|
|
|
|
|
Z3P3A3
|
|
|
Z3P3A2 Z3P2A3
|
Z2P3A3
|
|
P3A3
|
|
|
|
|
|
|
|
Z3P3A2
|
Z3P2A3 Z2P3A3
|
|
|
|
|
|
P3A2
|
P2A3
|
|
|
|
Z3P2A2
|
Z3P1A3
|
Z2P3A2
|
Z2P2A3
|
|
|
|
|
|
|
|
PRUNE WITH k ANONYMOUS
|
Z2P3A2
|
Z2P2A3
|
|
|
|
|
SUB LATTICE ON TWO ATTRIBUTES
|
PRUNED CANDIDATE ON
|
|
|
|
|
|
|
|
|
Z3P1A2
|
Z2P2A2
|
|
|
THREE ATTRIBUTES
|
|
|
|
|
|
|
|
|
|
CANDIDATE SUB LATTICE
ON THREE ATTRIBUTES
Incognito pruning
Figure 20.4: Incognito joins and pruning
In the previous example, the sublattice for the profession–age combination was not used for the join. However, it is still useful for pruning. This is because, if a node < Pi, Aj > is not present in this sublattice, then any node of the form < Zm , Pi, Aj > will also not be k-anonymous. Therefore, such nodes can be removed from the constructed candidate sublattice together with their specializations. An example of a pruning step on the candidate sublattice is illustrated in Fig. 20.4b. This pruning is based on the attribute-subset closure property, and it is reminiscent of Apriori pruning in frequent itemset mining. As in the case of frequent itemset mining, all k-attribute subsets of each candidate (k + 1)- sublattice in Ck +1 need to be checked. If a node violates the closure property in any of these checks, then it is pruned.
Finally, the generated nodes in Ck+1 need to checked against the original database to determine whether they satisfy k-anonymity. For example, in order to determine whether
Z1, A1 > satisfies k-anonymity based on the value generalization in Fig. 20.1, one needs to determine the number of individuals satisfying each of the pairs of condi-tions such as (ZIP code ∈ NY, 0 < Age ≤ 10), (ZIP code ∈ NY, 10 < Age ≤ 20), (ZIP code ∈ MA, 0 < Age ≤ 10), and so on. Therefore, for each node in the lattice,
678 CHAPTER 20. PRIVACY-PRESERVING DATA MINING
a vector of frequency values need to be computed. This vector is also referred to as a frequency vector or frequency set. The process of frequency vector computation can be expensive because the original database may need to be scanned to determine the number of tuples satisfying these conditions. However, several strategies can be used to reduce the burden of computation. For example, if the frequency vector of < Z1, A1 > has already been computed, one can use roll -up to directly compute the frequency vectors of the generalization < Z2, A1 > without actually scanning the database. This is because the frequency of the set (ZIP code ∈ Northeastern US, 0 < Age ≤ 10) is the sum of the frequencies of (ZIP code ∈ NY, 0 < Age ≤ 10), (ZIP code ∈ NJ, 0 < Age ≤ 10), (ZIP code ∈ MA, 0 < Age ≤ 10), and so on. The simplest approach is to use a breadth-first strategy on the lattice of each set of (k + 1) attributes, by determining the frequency vectors of specific (lower-level) nodes in the lattice before determining the frequency vectors of more general (higher-level) nodes. The frequency vectors of higher-level nodes can be computed efficiently from those of lower-level nodes by using the roll-up property.
Note that a separate breadth-first search needs to be performed for each subset of (k + 1) attributes in Ck +1 to compute its frequency vectors. Furthermore, once a node has been identified by the breadth-first search to be k-anonymous, its generalizations in the lattice are guaranteed to be k-anonymous. Therefore, they are automatically marked as k-anonymous and are not explicitly checked. The original algorithm also supports a number of other optimizations, referred to as Incognito super-roots and Bottom-up precomputation. The bibliographic notes contain pointers to these methods.
20.3.1.3 Mondrian Multidimensional k-Anonymity
One of the disadvantages of the methods discussed so far is that the domain generalization hierarchies for various attributes are constructed independently as a preprocessing step. Thus, after the hierarchical discretization (domain generalization) for a numeric attribute has been fixed by the preprocessing step, it is utilized by the anonymization algorithm. This rigidity in the anonymization process creates inefficiencies in data representation, when the various data attributes are correlated in multidimensional space. For example, the salary distribution for older individuals may be different from that of younger individuals. A pre-processed domain generalization hierarchy is unable to adjust to such attribute correlations in the data set. In general, the best trade-offs between privacy and utility are achieved when the multidimensional relationships among data points are leveraged in the anonymization process. In other words, the attribute ranges for each attribute in a data point X should be generated in a dynamic way depending on the specific multidimensional locality of X.
The Mondrian method generates multidimensional rectangular regions, containing at least k data points. This is achieved by recursively dividing the bounding boxes with axis-parallel cuts, until each region contains no more than k data points. This approach is not very different from the methodology used by many traditional index structures, such as k d-trees. An example of the partitioning induced by the Mondrian algorithm is illustrated in Fig. 20.5. In this case, a 5-anonymous partitioning is illustrated. Thus, each group contains at least five data points. It is easy to see that the same attribute value is represented by different ranges in different portions of the data, in order to account for the varying density of different regions. It is this flexibility that gives Mondrian a more compact representation of the anonymized groups than the other methods.
The Mondrian algorithm dynamically maintains the set B of multidimensional gen-eralizations that satisfy k-anonymity and cover the data set. The Mondrian algorithm starts with a rectangular box B of all the data points. This represents the generalization
20.3. PRIVACY-PRESERVING DATA PUBLISHING
|
679
|
Dostları ilə paylaş: |