Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə375/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   371   372   373   374   375   376   377   378   ...   423
1-Data Mining tarjima

TEST NODE


A B


A B

Figure 19.9: Label sparsity issues in collective classification


also uses this principle. Normalization will yield more balanced clusters in networks with widely varying density over the network.

19.4 Collective Classification


In many social networking applications, labels may be associated with nodes. For example, consider the case of a social networking application, where it is desirable to determine all individuals interested in golf. The labels of a small number of actors may already be available. It is desirable to use the available labels to perform the classification of nodes for which the label is not known.


The solution to this model is crucially dependent on the notion of homophily. Because nodes with similar properties are usually connected, it is reasonable to assume that this is also true of node labels. A simple solution to this problem is to examine the k labeled nodes in the proximity of a given node and report the majority label. This approach is, in fact, the network analog of a nearest neighbor classifier. However, such an approach is generally not possible in collective classification because of the sparsity of node labels. An example of a network is illustrated in Fig. 19.9, in which the two classes are labeled A and B. The remaining nodes are unlabeled. For the test node in Fig. 19.9, it is evident that it is generally closer to instances of A in the network structure, but there is no labeled node directly connected to the test instance.


Thus, it is evident that one must not only use the direct connections to labeled nodes, but also use the indirect connections through unlabeled nodes. Thus, collective classification in networks are always performed in a transductive semisupervised setting, where the test instances and training instances are classified jointly. In fact, as discussed in Sect. 11.6.3 of Chap. 11, collective classification methods can be used for semisupervised classification of any data type by transforming the data into a similarity graph. Thus, the collective




classification problem is important not only from the perspective of social network analysis, but also for semisupervised classification of any data type.

19.4.1 Iterative Classification Algorithm


The Iterative Classification Algorithm (ICA) is one of the earliest classification algorithms in the literature and has been applied to a wide variety of data domains. The algorithm has the capability to use content associated with the nodes for classification. This is important because many social networks have text content associated with the nodes in the form of user posts. Furthermore, in cases where this framework is used5 for semisupervised classification








  • cf. Sect. 11.6.3 of Chap. 11.

642 CHAPTER 19. SOCIAL NETWORK ANALYSIS

Algorithm ICA(Graph G = (N, A), Weights: [wij ], Node Class Labels: C,


Base Classifier: A, Number of Iterations: T )

begin

repeat
Extract link features at each node with current training data; Train classifier A using both link and content features of
current training data and predict labels of test nodes; Make (predicted) labels of most “certain” nt/T

test nodes final, and add these nodes to training data, while removing them from test data;


until T iterations;


end

Figure 19.10: The iterative classification algorithm (ICA)
of relational data with similarity graphs, the relational features continue to be available at the nodes for more effective classification.

Consider the (undirected) network G = (N, A) with class labels are drawn from {1 . . . k}. Each edge (i, j) ∈ A is associated with the weight wij . Furthermore, the content Xi is available at the node i in the form of a multidimensional feature vector. The total number of nodes is denoted by n, from which nt nodes are unlabeled test nodes.


An important step of the ICA algorithm is to derive a set of link features in addition to the available content features in Xi. The most important link features correspond to the distribution of the classes in the immediate neighborhood of the node. Therefore a feature is generated for each class, containing the fraction of its incident nodes belonging to that class. For each node i, its adjacent node j is weighted by wij for computing its credit to the relevant class. It is also possible, in principle, to derive other link features based on structural properties of the graph such as the degree of the node, PageRank values, number of closed triangles involving the node, or connectivity features. Such link features can be derived on the basis of an application-specific understanding of the network data set.


The basic ICA is structured as a meta-algorithm. A base classifier A is leveraged within an iterative framework. Many different base classifiers have been used in different implemen-tations, such as the naive Bayes classifier, logistic regression classifier, and a neighborhood voting classifier. The main requirement is that these classifiers should be able to output a numeric score that quantifies the likelihood of a node belonging to a particular class. While the framework is independent of specific choice of classifier, the use of the naive Bayes classifier is particularly common because of the interpretation of its numeric score as a probability. Therefore, the following discussion will assume that the algorithm A is instantiated to the naive Bayes classifier.

The link and content features are used to train the naive Bayes classifier. For many nodes, it is difficult to robustly estimate important class-specific features, such as the fractional presence of the different classes in their neighborhood. This is a direct result of label sparsity, and it makes the class predictions of such nodes unreliable. Therefore, an iterative approach is used for augmenting the training data set. In each iteration, nt/T (test) node labels are made “certain” by the approach, where T is a user-defined parameter controlling the maximum number of iterations. The test nodes, for which the Bayes classifier exhibits the highest class membership probabilities, are selected to be made final. These labeled test





19.4. COLLECTIVE CLASSIFICATION

643







Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   371   372   373   374   375   376   377   378   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin