Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	386/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 382 383 384 385 386 387 388 389 ... 423

1-Data Mining tarjima

j:(i,j)∈A,j is active

The node i becomes active in a step when I(i) ≥ θ_i. This process is repeated until no further nodes can be activated. The total influence f (S) may be measured as the number of nodes activated by a given seed set S. The influence f (S) of a given seed set S is typically computed with simulation methods.

19.6.2 Independent Cascade Model

In the aforementioned linear threshold model, once a node becomes active, it has multiple chances to influence its neighbors. The random variable θ_i was associated with a node, in the form of a threshold. On the other hand, in the independent cascade model, after a node becomes active, it obtains only a single chance to activate its neighbors, with propagation probabilities associated with the edges. The propagation probability associated with an edge is denoted by p_ij . In each iteration, only the newly active nodes are allowed to influence their neighbors, that have not already been activated. For a given node j, each of the edges (i, j) joining it to its newly active neighbors i flips a coin independently with success probability p_ij . If the coin toss for edge (i, j) results in a success, then the node j is activated. If node j is activated, it will get a single chance in the next iteration to influence its neighbors. In the event that no nodes are newly activated in an iteration, the algorithm terminates. The influence function value is equal to the number of active nodes at termination. Because nodes are allowed to influence their neighbors only once over the course of the algorithm, a coin is tossed for each edge at most once over the course of the algorithm.

19.6.3 Influence Function Evaluation

Both the linear threshold model and the independent cascade model are designed to compute the influence function f (S) with the use of a model. The estimation of f (S) is typically accomplished with simulation.

For example, consider the case of the linear threshold model. For a given seed node set S, one can use a random number generator to set the thresholds at the nodes. After the thresholds have been set, the active nodes can be labeled using any deterministic graph-search algorithm starting from the seed nodes in S and progressively activating nodes when the threshold condition is satisfied. The computation can be repeated over diﬀerent sets of randomly generated thresholds, and the results may be averaged to obtain more robust estimates.

In the independent cascade model, a diﬀerent simulation may be used. A coin with probability p_ij may be flipped for each edge. The edge is designated as live if the coin toss was a success. It can be shown that a node will eventually be activated by the independent

658 CHAPTER 19. SOCIAL NETWORK ANALYSIS

cascade model, when a path of live edges exists from at least one node in S to it. This can be used to estimate the size of the (final) active set by simulation. The computation is repeated over diﬀerent runs and the results are averaged.

The proof that the linear threshold model and the independent cascade model are sub-modular optimization problems can be found in pointers included in the bibliographic notes. However, this property is not specific to these models. Submodularity is a very natural conse-quence of the laws of diminishing returns, as applied to the incremental impact of individual influence in larger groups. As a result, most reasonable models for influence analysis will satisfy submodularity.

19.7 Summary

Social networks have become increasingly popular in recent years, because of their ability to connect geographically and culturally diverse participants. A significant amount of data is created because of the actions of social network participants. Much of this data are structural, in the form of relationships between diﬀerent individuals.

Social network structures exhibit a number of typical properties, because of the natural dynamics of their formation. The most important similarity-based properties include triadic closure, and homophily. Typically, social networks are formed by preferential attachment, and they exhibit power-law degree distributions.

The problem of clustering social networks is challenging because of the presence of hub nodes, and the natural tendency of social networks to cluster into a single large group. Therefore, most community detection algorithms have built-in mechanisms to ensure that the underlying clusters are balanced. Clustering methods are also sometimes referred to as graph- partitioning. One of the earliest clustering methods was the Kernighan–Lin method, which uses an iterative approach for clustering. Nodes are repeatedly exchanged between partitions to iteratively improve the value of the objective function. The Girvan–Newman algorithm uses notions of betweenness centrality to generate clusters. The METIS algorithm generates an eﬃcient partition by using coarsening and then creating the partitions on the coarsened representation. The spectral method uses multidimensional embeddings to generate the clusters.

In collective classification, the goal is to infer labels at the remaining vertices from the pre-existing labels at a subset of the vertices. This is a problem that has dual applicability to social network analysis and semisupervised learning. Multidimensional data sets can be transformed into similarity graphs to apply collective classification methods. The most common methods used for collective classification include iterative methods, random walk-based label propagation methods, and spectral methods.

In the link-prediction problem, the goal is to predict the links from the currently avail-able structure and content in the network. Structural measures are generally much more eﬀective for link-prediction than content-based measures. The structural methods use local clustering measures such as the Jaccard measure or personalized PageRank values for mak-ing predictions. Supervised methods are able to discriminatively determine the most relevant features for link prediction.

Social networks are often used for influencing individuals using “word-of-mouth” tech-niques. Typically, centrally located actors are more influential in the network. Diﬀusion models are used to characterize the flow of information in social networks. Two examples of such models include the linear threshold model and the independent cascade model.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 382 383 384 385 386 387 388 389 ... 423