Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	394/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 390 391 392 393 394 395 396 397 ... 423

1-Data Mining tarjima

ON THREE ATTRIBUTES k ANONYMOUS SUB LATTICES
PRUNE WITH k ANONYMOUS Z 2 P 3 A 2
SUB LATTICE ON TWO ATTRIBUTES PRUNED CANDIDATE ON
CANDIDATE SUB LATTICE ON THREE ATTRIBUTES

Z₃A₃			Z₃P₃A₃


Z₃A₂	Z₂A₃
			Z₃P₃A₂ Z₃P₂A₃			Z₂P₃A₃
	Z₂A₂		Z₁A₃
	+		JOIN	Z₃P₁A₃		Z₂P₃A₂Z₂P₂A₃
	Z₃P₃		Z₃P₂A₂

^Z3^P2			^Z2^P3	Z₃P₁A₂		Z₂P₂A₂
Z₃P₁		Z₂P₂			CANDIDATE SUB LATTICE
					ON THREE ATTRIBUTES
k ANONYMOUS SUB LATTICES
ON TWO ATTRIBUTES

Incognito join between two sub-lattices

^Z3^P3^A3

Z₃P₃A₂ Z₃P₂A₃

Z₂P₃A₃

^P3^A3

^Z3^P3^A2

Z₃P₂A₃ Z₂P₃A₃

^P3^A2

^P2^A3

Z₃P₂A₂

Z₃P₁A₃

Z₂P₃A₂

Z₂P₂A₃

PRUNE WITH k ANONYMOUS

Z₂P₃A₂

Z₂P₂A₃

SUB LATTICE ON TWO ATTRIBUTES

PRUNED CANDIDATE ON

Z₃P₁A₂

Z₂P₂A₂

THREE ATTRIBUTES

CANDIDATE SUB LATTICE

ON THREE ATTRIBUTES

Incognito pruning

Figure 20.4: Incognito joins and pruning

In the previous example, the sublattice for the profession–age combination was not used for the join. However, it is still useful for pruning. This is because, if a node < P_i, A_j > is not present in this sublattice, then any node of the form < Z_m , P_i, A_j > will also not be k-anonymous. Therefore, such nodes can be removed from the constructed candidate sublattice together with their specializations. An example of a pruning step on the candidate sublattice is illustrated in Fig. 20.4b. This pruning is based on the attribute-subset closure property, and it is reminiscent of Apriori pruning in frequent itemset mining. As in the case of frequent itemset mining, all k-attribute subsets of each candidate (k + 1)- sublattice in C_k ₊₁ need to be checked. If a node violates the closure property in any of these checks, then it is pruned.

Finally, the generated nodes in C_k₊₁ need to checked against the original database to determine whether they satisfy k-anonymity. For example, in order to determine whether

Z₁, A₁ > satisfies k-anonymity based on the value generalization in Fig. 20.1, one needs to determine the number of individuals satisfying each of the pairs of condi-tions such as (ZIP code ∈ NY, 0 < Age ≤ 10), (ZIP code ∈ NY, 10 < Age ≤ 20), (ZIP code ∈ MA, 0 < Age ≤ 10), and so on. Therefore, for each node in the lattice,

678 CHAPTER 20. PRIVACY-PRESERVING DATA MINING

a vector of frequency values need to be computed. This vector is also referred to as a frequency vector or frequency set. The process of frequency vector computation can be expensive because the original database may need to be scanned to determine the number of tuples satisfying these conditions. However, several strategies can be used to reduce the burden of computation. For example, if the frequency vector of < Z₁, A₁ > has already been computed, one can use roll -up to directly compute the frequency vectors of the generalization < Z₂, A₁ > without actually scanning the database. This is because the frequency of the set (ZIP code ∈ Northeastern US, 0 < Age ≤ 10) is the sum of the frequencies of (ZIP code ∈ NY, 0 < Age ≤ 10), (ZIP code ∈ NJ, 0 < Age ≤ 10), (ZIP code ∈ MA, 0 < Age ≤ 10), and so on. The simplest approach is to use a breadth-first strategy on the lattice of each set of (k + 1) attributes, by determining the frequency vectors of specific (lower-level) nodes in the lattice before determining the frequency vectors of more general (higher-level) nodes. The frequency vectors of higher-level nodes can be computed eﬃciently from those of lower-level nodes by using the roll-up property.

Note that a separate breadth-first search needs to be performed for each subset of (k + 1) attributes in C_k ₊₁ to compute its frequency vectors. Furthermore, once a node has been identified by the breadth-first search to be k-anonymous, its generalizations in the lattice are guaranteed to be k-anonymous. Therefore, they are automatically marked as k-anonymous and are not explicitly checked. The original algorithm also supports a number of other optimizations, referred to as Incognito super-roots and Bottom-up precomputation. The bibliographic notes contain pointers to these methods.

20.3.1.3 Mondrian Multidimensional k-Anonymity

One of the disadvantages of the methods discussed so far is that the domain generalization hierarchies for various attributes are constructed independently as a preprocessing step. Thus, after the hierarchical discretization (domain generalization) for a numeric attribute has been fixed by the preprocessing step, it is utilized by the anonymization algorithm. This rigidity in the anonymization process creates ineﬃciencies in data representation, when the various data attributes are correlated in multidimensional space. For example, the salary distribution for older individuals may be diﬀerent from that of younger individuals. A pre-processed domain generalization hierarchy is unable to adjust to such attribute correlations in the data set. In general, the best trade-oﬀs between privacy and utility are achieved when the multidimensional relationships among data points are leveraged in the anonymization process. In other words, the attribute ranges for each attribute in a data point X should be generated in a dynamic way depending on the specific multidimensional locality of X.

The Mondrian method generates multidimensional rectangular regions, containing at least k data points. This is achieved by recursively dividing the bounding boxes with axis-parallel cuts, until each region contains no more than k data points. This approach is not very diﬀerent from the methodology used by many traditional index structures, such as k d-trees. An example of the partitioning induced by the Mondrian algorithm is illustrated in Fig. 20.5. In this case, a 5-anonymous partitioning is illustrated. Thus, each group contains at least five data points. It is easy to see that the same attribute value is represented by diﬀerent ranges in diﬀerent portions of the data, in order to account for the varying density of diﬀerent regions. It is this flexibility that gives Mondrian a more compact representation of the anonymized groups than the other methods.

The Mondrian algorithm dynamically maintains the set B of multidimensional gen-eralizations that satisfy k-anonymity and cover the data set. The Mondrian algorithm starts with a rectangular box B of all the data points. This represents the generalization

20.3. PRIVACY-PRESERVING DATA PUBLISHING

679

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 390 391 392 393 394 395 396 397 ... 423