Data Mining: The Textbook

DATABASE 1 DATABASE 1

Yüklə 17,13 Mb.

səhifə	401/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 397 398 399 400 401 402 403 404 ... 423

1-Data Mining tarjima

DATABASE 1			DATABASE 1

	GROCERY			JEWELRY
	CHAIN 1

	DATABASE 2			DATABASE 2
GROCERY	GROCERY	WOMEN’S		WOMEN’S
CHAIN 4	CHAIN 2	APPAREL		SHOES
DATABASE 4		DATABASE 4
	GROCERY			COSMETICS
	CHAIN 3

	DATABASE 3			DATABASE 3
a HORIZONTALLY PARTITIONED DATA			b VERTICALLY PARTITIONED DATA

Figure 20.6: Examples of horizontally and vertically partitioned data

20.5 Distributed Privacy

In distributed privacy-preserving data mining, the goal is to mine shared insights across mul-tiple participants owning diﬀerent portions of the data, without compromising the privacy of local statistics or data records. The key is to understand that the diﬀerent participants may be partially or fully adversaries/competitors, and may not wish to provide full access of their local data and statistics to one another. However, they might find it mutually beneficial to extract global insights over all the data owned by them.

The data may be partitioned either horizontally or vertically across the diﬀerent partic-ipants. In horizontal partitioning, the data records owned by diﬀerent adversaries have the same attributes, but diﬀerent adversaries own diﬀerent portions of the database. For exam-ple, a set of supermarket chains may own similar data related to customer buying behavior, but the diﬀerent stores may show somewhat diﬀerent patterns in their transactions because of factors specific to their particular business. In vertical partitioning, the diﬀerent sites may contain diﬀerent attributes for the same individual. For example, consider a scenario in which a database contains transactions by various customers. A particular customer may buy diﬀerent kinds of items at stores containing complementary products such as jewelery, apparel, cosmetics, etc. In such cases, the aggregate association analysis across diﬀerent participants can provide insights, that cannot be inferred from any particular database. Examples of horizontal and vertically partitioned data are provided in Figs. 20.6a and b, respectively.

At the most primitive level, the problem of distributed privacy-preserving data mining overlaps closely with a field in cryptography for determining secure multi-party compu-tations. In this field, functions are computed over inputs provided by multiple recipients without actually sharing the inputs with one another. For example, in a two-party setting, Alice and Bob may have two inputs x and y, respectively, and may wish to compute the function f (x, y) without revealing x or y to each other. This problem can also be general-ized across k parties for computing the k argument function h(x₁ . . . x_k) . Many data mining algorithms may be viewed in the context of repetitive computations of primitive functions such as the scalar dot product, secure sum, secure set union, etc. For example, the scalar dot product of the binary representation of an itemset and a transaction can be used to determine whether or not that itemset is supported by that transaction. Similarly, scalar

690 CHAPTER 20. PRIVACY-PRESERVING DATA MINING

dot products can be used for similarity computations in clustering. To compute the function f (x, y ) or h(x₁ . . . , x_k), a protocol needs to be designed for exchanging information in such a way that the function is computed without compromising privacy.

A key building-block for many kinds of secure function evaluations is the 1 out of 2 oblivious-transfer protocol. This protocol involves two parties: a sender, and a receiver. The sender’s input is a pair (x₀, x₁), and the receiver’s input is a bit value σ ∈ {0, 1}. At the end of the process, the receiver learns x_σ only, and the sender learns nothing. In other words, the sender does not learn the value of σ.

In the oblivious transfer protocol, the sender generates two encryption keys, K₀ and K₁, but the protocol is able to ensure that the receiver knows only the decryption key for K_σ. The sender is able to generate these keys by using an encrypted input from the receiver, which encodes σ. This coded input does not reveal the value of σ to the sender, but is suﬃcient to generate K₀ and K₁ . The sender encrypts x₀ with K₀, x₁ with K₁, and sends the encrypted data back to the receiver. At this point, the receiver can only decrypt x _σ, since this is the only input for which he or she has the decryption key. The 1 out of 2 oblivious transfer protocol has been generalized to the case of k out of N participants.

The oblivious transfer protocol is a basic building block, and can be used in order to compute several data mining primitives related to vector distances. Another important pro-tocol that is used by frequent pattern mining algorithms is the secure set union protocol. This protocol allows the computation of unions of sets in a distributed way, without reveal-ing the actual sources of the constituent elements. This is particularly useful in frequent pattern mining algorithms, because the locally large itemsets at the diﬀerent sites need to be aggregated. The key in these methods is to disguise the frequent patterns at each site with enough number of fake itemsets, in order to disguise the true locally large itemsets at each site. Furthermore, it can be shown that this protocol can be generalized to com-pute diﬀerent kinds of functions for various data mining problems on both horizontally and vertically partitioned data. The bibliographic notes contain pointers to surveys on these techniques.

20.6 Summary

Privacy-preserving data mining can be executed at diﬀerent stages of the information pro-cessing pipeline, such as data collection, data publication, output publication, or distributed data sharing. The only known method for privacy protection at data collection, is the ran-domization method. In this method, additive noise is incorporated in the data at data collection time. The aggregate reconstructions of the data are then used for mining.

Privacy-preserving data publishing is typically performed using a group-based approach. In this approach, the sensitive attributes are treated in a diﬀerent way from the attributes that are combined to construct quasi-identifiers. Only the latter types of attributes are perturbed, in order to prevent identification of the subjects of the data records. Numerous models, such as k-anonymity, -diversity, and t- closeness are used for anonymization. The eventual goal of all these methods is to prevent the release of sensitive information about individuals. When the dimensionality of the data increases, privacy preservation becomes very diﬃcult, without a complete loss of utility.

In some cases, the output of data mining applications, such as association rule mining and query processing, may lead to release of sensitive information. Therefore, in many cases, the output of these applications may need to be restricted in to prevent the release

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 397 398 399 400 401 402 403 404 ... 423