Data Mining: The Textbook


DATABASE 1 DATABASE 1



Yüklə 17,13 Mb.
səhifə401/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   397   398   399   400   401   402   403   404   ...   423
1-Data Mining tarjima

DATABASE 1




DATABASE 1







GROCERY




JEWELRY







CHAIN 1






















DATABASE 2




DATABASE 2




GROCERY

GROCERY

WOMEN’S

WOMEN’S




CHAIN 4

CHAIN 2

APPAREL

SHOES




DATABASE 4




DATABASE 4










GROCERY




COSMETICS







CHAIN 3






















DATABASE 3




DATABASE 3




a HORIZONTALLY PARTITIONED DATA

b VERTICALLY PARTITIONED DATA




Figure 20.6: Examples of horizontally and vertically partitioned data


20.5 Distributed Privacy


In distributed privacy-preserving data mining, the goal is to mine shared insights across mul-tiple participants owning different portions of the data, without compromising the privacy of local statistics or data records. The key is to understand that the different participants may be partially or fully adversaries/competitors, and may not wish to provide full access of their local data and statistics to one another. However, they might find it mutually beneficial to extract global insights over all the data owned by them.


The data may be partitioned either horizontally or vertically across the different partic-ipants. In horizontal partitioning, the data records owned by different adversaries have the same attributes, but different adversaries own different portions of the database. For exam-ple, a set of supermarket chains may own similar data related to customer buying behavior, but the different stores may show somewhat different patterns in their transactions because of factors specific to their particular business. In vertical partitioning, the different sites may contain different attributes for the same individual. For example, consider a scenario in which a database contains transactions by various customers. A particular customer may buy different kinds of items at stores containing complementary products such as jewelery, apparel, cosmetics, etc. In such cases, the aggregate association analysis across different participants can provide insights, that cannot be inferred from any particular database. Examples of horizontal and vertically partitioned data are provided in Figs. 20.6a and b, respectively.


At the most primitive level, the problem of distributed privacy-preserving data mining overlaps closely with a field in cryptography for determining secure multi-party compu-tations. In this field, functions are computed over inputs provided by multiple recipients without actually sharing the inputs with one another. For example, in a two-party setting, Alice and Bob may have two inputs x and y, respectively, and may wish to compute the function f (x, y) without revealing x or y to each other. This problem can also be general-ized across k parties for computing the k argument function h(x1 . . . xk) . Many data mining algorithms may be viewed in the context of repetitive computations of primitive functions such as the scalar dot product, secure sum, secure set union, etc. For example, the scalar dot product of the binary representation of an itemset and a transaction can be used to determine whether or not that itemset is supported by that transaction. Similarly, scalar


690 CHAPTER 20. PRIVACY-PRESERVING DATA MINING


dot products can be used for similarity computations in clustering. To compute the function f (x, y ) or h(x1 . . . , xk), a protocol needs to be designed for exchanging information in such a way that the function is computed without compromising privacy.


A key building-block for many kinds of secure function evaluations is the 1 out of 2 oblivious-transfer protocol. This protocol involves two parties: a sender, and a receiver. The sender’s input is a pair (x0, x1), and the receiver’s input is a bit value σ ∈ {0, 1}. At the end of the process, the receiver learns xσ only, and the sender learns nothing. In other words, the sender does not learn the value of σ.


In the oblivious transfer protocol, the sender generates two encryption keys, K0 and K1, but the protocol is able to ensure that the receiver knows only the decryption key for Kσ. The sender is able to generate these keys by using an encrypted input from the receiver, which encodes σ. This coded input does not reveal the value of σ to the sender, but is sufficient to generate K0 and K1 . The sender encrypts x0 with K0, x1 with K1, and sends the encrypted data back to the receiver. At this point, the receiver can only decrypt x σ, since this is the only input for which he or she has the decryption key. The 1 out of 2 oblivious transfer protocol has been generalized to the case of k out of N participants.


The oblivious transfer protocol is a basic building block, and can be used in order to compute several data mining primitives related to vector distances. Another important pro-tocol that is used by frequent pattern mining algorithms is the secure set union protocol. This protocol allows the computation of unions of sets in a distributed way, without reveal-ing the actual sources of the constituent elements. This is particularly useful in frequent pattern mining algorithms, because the locally large itemsets at the different sites need to be aggregated. The key in these methods is to disguise the frequent patterns at each site with enough number of fake itemsets, in order to disguise the true locally large itemsets at each site. Furthermore, it can be shown that this protocol can be generalized to com-pute different kinds of functions for various data mining problems on both horizontally and vertically partitioned data. The bibliographic notes contain pointers to surveys on these techniques.


20.6 Summary

Privacy-preserving data mining can be executed at different stages of the information pro-cessing pipeline, such as data collection, data publication, output publication, or distributed data sharing. The only known method for privacy protection at data collection, is the ran-domization method. In this method, additive noise is incorporated in the data at data collection time. The aggregate reconstructions of the data are then used for mining.


Privacy-preserving data publishing is typically performed using a group-based approach. In this approach, the sensitive attributes are treated in a different way from the attributes that are combined to construct quasi-identifiers. Only the latter types of attributes are perturbed, in order to prevent identification of the subjects of the data records. Numerous models, such as k-anonymity, -diversity, and t- closeness are used for anonymization. The eventual goal of all these methods is to prevent the release of sensitive information about individuals. When the dimensionality of the data increases, privacy preservation becomes very difficult, without a complete loss of utility.


In some cases, the output of data mining applications, such as association rule mining and query processing, may lead to release of sensitive information. Therefore, in many cases, the output of these applications may need to be restricted in to prevent the release




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   397   398   399   400   401   402   403   404   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin