Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə386/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   382   383   384   385   386   387   388   389   ...   423
1-Data Mining tarjima

j:(i,j)∈A,j is active

The node i becomes active in a step when I(i) ≥ θi. This process is repeated until no further nodes can be activated. The total influence f (S) may be measured as the number of nodes activated by a given seed set S. The influence f (S) of a given seed set S is typically computed with simulation methods.


19.6.2 Independent Cascade Model


In the aforementioned linear threshold model, once a node becomes active, it has multiple chances to influence its neighbors. The random variable θi was associated with a node, in the form of a threshold. On the other hand, in the independent cascade model, after a node becomes active, it obtains only a single chance to activate its neighbors, with propagation probabilities associated with the edges. The propagation probability associated with an edge is denoted by pij . In each iteration, only the newly active nodes are allowed to influence their neighbors, that have not already been activated. For a given node j, each of the edges (i, j) joining it to its newly active neighbors i flips a coin independently with success probability pij . If the coin toss for edge (i, j) results in a success, then the node j is activated. If node j is activated, it will get a single chance in the next iteration to influence its neighbors. In the event that no nodes are newly activated in an iteration, the algorithm terminates. The influence function value is equal to the number of active nodes at termination. Because nodes are allowed to influence their neighbors only once over the course of the algorithm, a coin is tossed for each edge at most once over the course of the algorithm.


19.6.3 Influence Function Evaluation


Both the linear threshold model and the independent cascade model are designed to compute the influence function f (S) with the use of a model. The estimation of f (S) is typically accomplished with simulation.


For example, consider the case of the linear threshold model. For a given seed node set S, one can use a random number generator to set the thresholds at the nodes. After the thresholds have been set, the active nodes can be labeled using any deterministic graph-search algorithm starting from the seed nodes in S and progressively activating nodes when the threshold condition is satisfied. The computation can be repeated over different sets of randomly generated thresholds, and the results may be averaged to obtain more robust estimates.


In the independent cascade model, a different simulation may be used. A coin with probability pij may be flipped for each edge. The edge is designated as live if the coin toss was a success. It can be shown that a node will eventually be activated by the independent


658 CHAPTER 19. SOCIAL NETWORK ANALYSIS


cascade model, when a path of live edges exists from at least one node in S to it. This can be used to estimate the size of the (final) active set by simulation. The computation is repeated over different runs and the results are averaged.


The proof that the linear threshold model and the independent cascade model are sub-modular optimization problems can be found in pointers included in the bibliographic notes. However, this property is not specific to these models. Submodularity is a very natural conse-quence of the laws of diminishing returns, as applied to the incremental impact of individual influence in larger groups. As a result, most reasonable models for influence analysis will satisfy submodularity.


19.7 Summary

Social networks have become increasingly popular in recent years, because of their ability to connect geographically and culturally diverse participants. A significant amount of data is created because of the actions of social network participants. Much of this data are structural, in the form of relationships between different individuals.


Social network structures exhibit a number of typical properties, because of the natural dynamics of their formation. The most important similarity-based properties include triadic closure, and homophily. Typically, social networks are formed by preferential attachment, and they exhibit power-law degree distributions.


The problem of clustering social networks is challenging because of the presence of hub nodes, and the natural tendency of social networks to cluster into a single large group. Therefore, most community detection algorithms have built-in mechanisms to ensure that the underlying clusters are balanced. Clustering methods are also sometimes referred to as graph- partitioning. One of the earliest clustering methods was the Kernighan–Lin method, which uses an iterative approach for clustering. Nodes are repeatedly exchanged between partitions to iteratively improve the value of the objective function. The Girvan–Newman algorithm uses notions of betweenness centrality to generate clusters. The METIS algorithm generates an efficient partition by using coarsening and then creating the partitions on the coarsened representation. The spectral method uses multidimensional embeddings to generate the clusters.


In collective classification, the goal is to infer labels at the remaining vertices from the pre-existing labels at a subset of the vertices. This is a problem that has dual applicability to social network analysis and semisupervised learning. Multidimensional data sets can be transformed into similarity graphs to apply collective classification methods. The most common methods used for collective classification include iterative methods, random walk-based label propagation methods, and spectral methods.


In the link-prediction problem, the goal is to predict the links from the currently avail-able structure and content in the network. Structural measures are generally much more effective for link-prediction than content-based measures. The structural methods use local clustering measures such as the Jaccard measure or personalized PageRank values for mak-ing predictions. Supervised methods are able to discriminatively determine the most relevant features for link prediction.


Social networks are often used for influencing individuals using “word-of-mouth” tech-niques. Typically, centrally located actors are more influential in the network. Diffusion models are used to characterize the flow of information in social networks. Two examples of such models include the linear threshold model and the independent cascade model.




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   382   383   384   385   386   387   388   389   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin