Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə391/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   387   388   389   390   391   392   393   394   ...   423
1-Data Mining tarjima

w=a







w=−∞ fX(w|X + Y = z1)dw

(20.1)




FX(a) =







w=







w=−∞ fX(w|X + Y = z1)dw







The conditional in the aforementioned equation corresponds to the fact that the sum of the data values and perturbed values is equal to z1. Note that the expression fX(w|X+Y = z1) can be expressed in terms of the unconditional densities of X and Y, as follows:



fX(w|X + Y = z1) = fY (z1 − w) · fX(w)

(20.2)

This expression uses the fact that the perturbation Y is independent of X. By substituting the aforementioned expression for f X(w |X + Y = z1) in the right-hand side of Eq. 20.1, the following expression is obtained for the cumulative density of X:





ˆ

w=a

w) · fX(w)dw







w=−∞ fY (z1

(20.3)




FX(a) =










w=

w) · fX(w)dw







w=−∞ fY (z1










ˆ

, and needs to be gener-




The expression for FX(a) was derived using a single observation z1




alized to the case of n different observations z1 . . . zn . This can be achieved by averaging the previous expression over n different values:



ˆ

1

n

w=a










w=−∞ fY (zi w) · fX(w)dw

(20.4)




FX(a) =




·







n

w=










i=1 w=−∞ fY (zi w) · fX(w)dw







The corresponding density distribution can be obtained by differentiating ˆ ( ). This dif-FX a
ferentiation results in the removal of the integral sign from the numerator and the corre-sponding instantiation of w to a. Since the denominator is a constant, it remains unaffected by the differentiation. Therefore, the following is true:



fˆ







1




n

fY (zi − a) · fX(a)







(a) =

·




(20.5)




n




w=




X




























i=1 w=−∞ fY (zi w) · fX(w)dw







The aforementioned equation contains the density function fX(·) on both sides. This circu-larity can be resolved naturally with the use of an iterative approach. The iterative approach initializes the estimate of the distribution fX(·) to the uniform distribution. Subsequently, the estimate of this distribution is continuously updated as follows:


ˆ
Set fX(·) to be the uniform distribution;
repeat

ˆ







n
















ˆ







(a) = (1/n)













fY (zi−a)·fX(a)




Update f

X




·




























i=1

w=



fY (zi



w)

fˆX(w)dw










w=







until convergence










−∞




·





































So far, it has been described, how to compute fX(a) for a particular value of a. In order to generalize this approach, the idea is to discretize the range of the random variable X into k intervals, denoted by [l1, u1] . . . [lk, uk]. It is assumed that the density distribution is uniform over the discretized intervals. For each such interval [li, ui], the density distribution is evaluated at the midpoint a = (li + ui)/2 of the interval. Thus, in each iteration, k different values of a are used. The algorithm is terminated when the distribution does not





20.3. PRIVACY-PRESERVING DATA PUBLISHING

667

change significantly over successive steps of the algorithm. A variety of methods can be used to compare the two distributions such as the χ2 test. The simplest approach is to examine the average change in the density values, at the midpoints of the density distribution over successive iterations. While this algorithm is known to perform effectively in practice, it is not proven to be a optimally convergent solution. An expectation maximization (EM) method was proposed in a later work [28], which provably converges to the optimal solution.


20.2.2 Leveraging Aggregate Distributions for Data Mining


The aggregate distributions determined by the algorithm can be leveraged for a variety of data mining problems, such as clustering, classification, and collaborative filtering. This is because each of these data mining problems can be implemented with aggregate statistics of the data, rather than the original data records. In the case of the classification problem, the probability distributions of each of the classes can be reconstructed from the data. These distributions can then be used directly in the context of a naive Bayes classifier, as discussed in Chap. 10. Other classifiers, such as decision trees, can also be modified to work with aggregate distributions. The key is to use the aggregate distributions in order to design the split criterion of the decision tree. The bibliographic notes contain pointers to data mining algorithms that use the randomization method. The approach cannot be used effectively for data mining problems such as outlier detection that are dependent on individual data record values, rather than aggregate values. In general, outlier analysis is a difficult problem for most private data sets because of the tendency of outliers to reveal private information.


20.3 Privacy-Preserving Data Publishing

Privacy- preserving data publishing is distinct from privacy-preserving data collection, because it is assumed that all the records are already available to a trusted party, who might be the current owner of the data. This party then wants to release (or publish) this data for analysis. For example, a hospital might wish to release anonymized records about patients to study the effectiveness of various treatment alternatives.


This form of data release is quite useful, because virtually any data mining algorithm can be used on the released data. To determine sensitive information about an individual, there are two main pieces of information that an attacker (or adversary) must possess.





  1. Who does this data record pertain to? While a straightforward way to determine the identity is to use the identifying attribute (e.g., Social Security Number), such attributes are usually stripped from the data before release. As will be discussed later, these straightforward methods of sanitization are usually not sufficient, because attackers may use other attributes, such as the age and ZIP code, to make linkage attacks.




  1. In addition to identifying attributes, data records also contain sensitive attributes that most individuals do not wish to share with others. For example, when a hospital releases medical data, the records might contain sensitive disease-related attributes.

Different attributes in a data set may play different roles in either facilitating identification or facilitating sensitive information release. There are three main types of attributes:





668




CHAPTER 20. PRIVACY-PRESERVING DATA MINING










Table 20.1: Example of a data table


































SSN

Age

ZIP Code

Disease























































012-345-6789

24

10598

HIV










823-627-9231

37

90210

Hepatitis C













987-654-3210

26

10547

HIV













382-827-8264

38

90345

Hepatitis C













847-872-7276

36

89119

Diabetes













422-061-0089

25

02139

HIV
































  1. Explicit identifiers: These are attributes that explicitly identify an individual. For example, the Social Security Number (SSN) of an individual can be considered an explicit identifier. Because this attribute is almost always removed in the data saniti-zation process, it is not relevant to the study of privacy algorithms.




  1. Pseudo-identifier or quasi-identifier (QID): These are attributes that do not explicitly identify an individual in isolation, but can nevertheless be used in combination to identify an individual by joining them with publicly available information, such as voter registration rolls. This kind of attack is referred to as a linkage attack. Examples of such attributes include the Age and ZIP code. Strictly speaking, quasi-identifiers refer to the specific combination of attributes used to make a linkage attack, rather than the individual attributes.




  1. Sensitive attribute: These are attributes that are considered private by most individu-als. For example, in a medical data set, individuals would not like information about their diseases to be known publicly. In fact, many laws in the USA, such as the Health Insurance Portability and Accountability Act (HIPAA), explicitly forbid the release of such information, especially when the sensitive attributes can be linked back to specific individuals.

Most of the discussion in this chapter will be restricted to quasi-identifiers and sensitive attributes. To illustrate the significance of these attribute types, an example will be used. In Table 20.1, the medical records of a set of individuals are illustrated. The SSN attribute is an explicit identifier that can be utilized to identify an individual directly. Such directly identifying information will almost always be removed from a data set before release. How-ever, the impact of attributes such as the age and the ZIP code on identification is quite significant. While these attributes do not directly identify an individual, they provide very useful hints, when combined with other publicly available information. For example, it is possible for a small geographic region, such as a ZIP code, to contain only one individual of a specific gender, race, and date of birth. When combined with publicly available voter registration rolls, one might be able to identify an individual from these attributes. Such a combination of publicly available attributes is referred to as a quasi-identifier.


To understand the power of quasi-identifiers, consider a snapshot of the voter registra-tion rolls illustrated in Table 20.2. Even in cases, where the SSN is removed from Table 20.1 before release, it is possible to join the two tables with the use of the age and ZIP code attributes. This will provide a list of the possible matches for each data record. For exam-ple, Joy and Sue are the only two individuals in the voter registration rolls matching an individual with HIV in the medical release of Table 20.1. Therefore, one can tell with 50 % certainty that Joy and Sue have HIV. This is not desirable especially when an adversary





20.3. PRIVACY-PRESERVING DATA PUBLISHING

669

Table 20.2: Example of a snapshot of fictitious voter registration rolls




























Name




Age

ZIP Code



































































Mary A.




38

90345










John S.




36

89119










Ann L.




31

02139










Jack M.




57

10562










Joy M.




26

10547










Victor B.




46

90345










Peter P.




25

02139










Diana X.




24

10598










William W.




37

90210










Sue G.




26

10547




























has other background medical information about Joy or Sue to further narrow down the possibilities. Similarly, William is the only individual in the voter registration rolls, who matches an individual with hepatitis C in the medical release. In cases, where only one data record in the voter registration rolls matches the particular combination of age and ZIP code, sensitive medical conditions about that individual may be fully compromised. This approach is referred to as a linkage attack. Most anonymization algorithms focus on pre-venting identity disclosure, rather than explicitly hiding the sensitive attributes. Thus, only the attributes which can be combined to construct quasi-identifiers are changed or specified approximately in the data release, whereas sensitive attributes are released in their exact form.

Many privacy -preserving data publishing algorithms assume that the quasi-identifiers are drawn out of a set of attributes that are not sensitive, because they can only be used by an adversary by performing joins with (nonsensitive) publicly available information. This assumption may, however, not always be reasonable, when an adversary has (sensi-tive) background information about a target at hand. Adversaries are often familiar with their targets, and they can be assumed to have background knowledge about at least a subset of the sensitive attributes. In a medical application with multiple disease attributes, knowledge about a subset of these attributes may reveal the identity of the subject of the record. Similarly, in a movie collaborative filtering application, where anonymized ratings are released, it may be possible to obtain information about a particular user’s ratings on a subset of movies, through personal interaction or other rating sources. If this combination is unique to the individual, then the other ratings of the individual are compromised as well. Thus, sensitive attributes also need to be perturbed, when background knowledge is available. Much of the work in the privacy literature assumes a rigid distinction between the role of publicly available attributes (from which the quasi-identifiers are constructed) and that of the sensitive attributes. In other words, sensitive attributes are not perturbed because it is assumed that revealing them does not incur the risk of a linkage attack with publicly available information. There are, however, a few algorithms that do not make this distinction. Such algorithms generally provide better privacy protection in the presence of background information.


In this section, several models for group-based anonymization, such as k -anonymity, - diversity, and t-closeness, will be introduced. While the recent models, such as -diversity, have certain advantages over the k-anonymity model, a good understanding of k-anonymity

670 CHAPTER 20. PRIVACY-PRESERVING DATA MINING


is crucial in any study of privacy-preserving data publishing. This is because the basic framework for most of the group-based anonymization models was first proposed in the context of the k-anonymity model. Furthermore, many algorithms for other models, such as -diversity, build upon algorithms for k-anonymization.


20.3.1 The k-Anonymity Model


The k-anonymity model is one of the oldest ones for data anonymization, and it is credited with the understanding of the concept of quasi-identifiers and their impact on data privacy. The basic idea in k-anonymization methods is to allow release of the sensitive attributes, while distorting only the attributes which are available through public sources of informa-tion. Thus, even though the sensitive attributes have been released, they cannot be linked to an individual through publicly available records. Before discussing the anonymization algorithms, some of the most common techniques for data distortion will be discussed.





  1. Suppression: In this approach, some of the attribute values are suppressed. Depending on the algorithm used, the suppression can be done in a variety of ways. For example, one might omit some of the age or ZIP code attribute values from a few selected data records in Table 20.1. Alternatively, one might completely omit the entire record for a specific individual (row suppression) or the age attribute from all individuals (column suppression). Row suppression is often utilized to remove outlier records because such records are difficult to anonymize. Column suppression is commonly used to remove highly identifying attributes, or explicit identifiers, such as the SSN.




  1. Generalization: In the case of generalization, the attributes are specified approxi-mately in terms of a particular range. For example, instead of specifying Age = 26 and Location (ZIP Code) = 10547 for one of the entries of Table 20.1, one might generalize it to Age ∈ [25, 30] and Location (State) = New York. By specifying the attributes approximately, it becomes more difficult for an adversary to perform linkage attacks. While numeric data can be generalized to specific ranges, the generalization of categorical data is somewhat more complicated. Typically, a generalization hierarchy of the categorical attribute values needs to be provided, for use in the anonymization process. For example, a ZIP code may be generalized to a city, which in turn may be generalized to a state, and so on. There is no unique way of specifying a domain hierar-chy. Typically, it needs to be semantically meaningful, and it is specified by a domain expert as a part of the input to the anonymization process. An example of a general-ization taxonomy of categorical attributes for the location attribute of Table 20.1 is provided in Fig. 20.1. This hierarchy of attribute values has a tree structure, and is referred to as a value generalization hierarchy. The notations A0 . . . A3 and Z0 . . . Z4 in Fig. 20.1 denote the domain generalizations at different levels of granularity. The corresponding domain generalization hierarchies are also illustrated in the Fig. 20.1 by the single path between Z0 . . . Z4 and A0 . . . A4.




  1. Synthetic data generation: In this case, a synthetic data set is generated that mimics the statistical properties of the original data, at the group level. Such an approach can provide better privacy, because it is more difficult to map synthetic data records to particular groups of records. On the other hand, the data records are no longer truthful because they are synthetically generated.




  1. Specification as probabilistic and uncertain databases: In this case, one might specify an individual data record as a probability distribution function. This is different from the

20.3. PRIVACY-PRESERVING DATA PUBLISHING













671



















AGE

















Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   387   388   389   390   391   392   393   394   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin