Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə363/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   359   360   361   362   363   364   365   366   ...   423
1-Data Mining tarjima

18.9. EXERCISES

617




  1. Perform a Google search on “mining text data” and “text data mining.” Do you get the same top-10 search results? What does this tell you about the content component of the ranking heuristic used by search engines?




  1. Show that the PageRank computation with teleportation is an eigenvector computa-tion on an appropriately constructed probability transition matrix.




  1. Show that the hub and authority scores in HITS can be computed by dominant eigenvector computations on AAT and AT A respectively. Here, A is the adjacency matrix of the graph G = (S, A), as defined in the chapter.




  1. Show that the largest eigenvalue of a stochastic transition matrix is always 1.




  1. Suppose that you are told that a particular transition matrix P can be diagonalized as P = V ΛV −1, where Λ is diagonal. How can you use this result to efficiently determine the k-hop transition matrix which defines the probability of a transition between each pair of nodes in k hops? What would you do for the special case when k = ? Does the result hold if we allow the entries of P and V to be complex numbers?




  1. Apply the PageRank algorithm to the graph of Fig. 18.2b, using teleportation probabil-ities of 0.1, 0.2, and 0.4, respectively. What is the impact on the dead-end component (probabilities) of increasing the teleportation probabilities?




  1. Repeat the previous exercise, except that the restart is performed from node 1. How are steady-state probabilities affected by increasing the teleportation probability?




  1. Show that the transition matrix of the graph of Fig. 18.4.1b will have more than one eigenvector with an eigenvalue of 1. Why is the eigenvector with unit eigenvalue not unique in this case?




  1. Implement the neighborhood-based approach for collaborative filtering on a ratings matrix.




  1. Implement the personalized PageRank approach for collaborative filtering on a positive-preference utility matrix.




  1. Apply the PageRank algorithm to the example of Fig. 18.5 by setting restart proba-bilities to 0.1, 0.2, and 0.4, respectively.




  1. Apply the personalized PageRank algorithm to the example of Fig. 18.5 by restarting at node Gladiator, and with restart probabilities of 0.1, 0.2, and 0.4, respectively. What does this tell you about the most relevant users for the movie Gladiator What does this tell you about the most relevant user for the movie “Gladiator,” who has not already watched this movie? Is it possible for the most relevant user to change with teleportation probability? What is the intuitive significance of the teleportation probability from an application-specific perspective?




  1. Construct the optimization formulation for the matrix factorization problem for incomplete matrices.




  1. In the bipartite graph of Fig. 18.5, what is the SimRank value between a user node and an item node? In this light, explain the weakness of the SimRank model.



Chapter 19


Social Network Analysis

I hope we will use the Net to cross barriers and connect cultures.”—Tim Berners-Lee


19.1 Introduction


The tendency of humans to connect with one another is a deep-rooted social need that precedes the advent of the Web and Internet technologies. In the past, social interactions were achieved through face-to-face contact, postal mail, and telecommunication technolo-gies. The last of these is also relatively recent when compared with the history of mankind. However, the popularization of the Web and Internet technologies has opened up entirely new avenues for enabling the seamless interaction of geographically distributed participants. This extraordinary potential of the Web was observed during its infancy by its visionary founders. However, it required a decade before the true social potential of the Web could be realized. Even today, Web-based social applications continue to evolve and create an ever-increasing amount of data. This data is a treasure trove of information about user preferences, their connections, and their influences on others. Therefore, it is natural to leverage this data for analytical insights.


Although social networks are popularly understood in the context of large online net-works such as Twitter, LinkedIn, and Facebook, such networks represent only a small minor-ity of the interaction mechanisms enabled by the Web. In fact, the traditional study of social network analysis in the field of sociology precedes the popularization of technologi-cally enabled mechanisms. Much of the discussion in this chapter applies to social networks that extend beyond the popular notions of online social networks. Some examples are as follows:





  • Social networks have been studied extensively in the field of sociology for more than a century but not from an online perspective. Data collection was rather difficult in these scenarios because of the lack of adequate technological mechanisms. Therefore, these studies were often conducted with painstaking and laborious methods for manual data collection. An example of such an effort is Stanley Milgram’s famous six degrees of separation experiment in the sixties, which used postal mail between participants




C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 19

619

c Springer International Publishing Switzerland 2015



620 CHAPTER 19. SOCIAL NETWORK ANALYSIS

to test whether two arbitrary humans on the planet could be connected by a chain of six relationships. Because of the difficulty in verifying local forwards of mail, such experiments were often hard to conduct in a trustworthy way. Nevertheless, in spite of the obvious flaws in the experimental setting, these results have recently been shown to be applicable to online social networks, where the relationships between individuals are more easily quantifiable.





  • A number of technological enablers, such as telecommunications, email, and electronic chat messengers, can be considered indirect forms of social networks. Such enablers result in communications between different individuals, and therefore they have a natural social aspect.




  • Sites that are used for sharing online media content, such as Flickr, YouTube, or Deli-cious, can also be considered indirect forms of social networks, because they allow an extensive level of user interaction. In addition, social media outlets provide a num-ber of unique ways for users to interact with one another. Examples include posting blogs or tagging each other’s images. In these cases, the interaction is centered around a specific service such as content-sharing; yet many fundamental principles of social networking apply. Such social networks are extremely rich from the perspective of min-ing applications. They contain a tremendous amount of content such as text, images, audio, or video.




  • A number of social networks can be constructed from specific kinds of interactions in professional communities. Scientific communities are organized into bibliographic and citation networks. These networks are also content rich because they are organized around publications.

It is evident that these different kinds of networks illustrate different facets of social network analysis. Many of the fundamental problems discussed in this chapter apply to these different scenarios but in different settings. Most of the traditional problems in data mining, such as clustering and classification, can also be extended to social network analysis. Furthermore, a number of more complex problem definitions are possible, such as link prediction and social influence analysis, because of the greater complexity of networks as compared to other kinds of data.


This chapter is organized as follows. Section 19.2 discusses a number of fundamental properties of social network analysis. The problem of community detection is explained in Sect. 19.3. The collective classification problem is discussed in Sect. 19.4. Section 19.5 discusses the link prediction problem. The social influence analysis problem is addressed in Sect. 19.6. The chapter summary is presented in Sect. 19.7.


19.2 Social Networks: Preliminaries and Properties


It is assumed that the social network can be structured as a graph G = (N, A), where N is the set of nodes and A is the set of edges. Each individual in the social network is represented by a node in N , and is also referred to as an actor. The edges represent the connections between the different actors. In a social network such as Facebook, these edges correspond to friendship links. Typically, these links are undirected, although it is also possible for some “follower-based” social networks, such as Twitter, to have directed links. By default, it will be assumed that the network G = (N, A) is undirected, unless otherwise specified. In some cases, the nodes in N may have content associated with them. This content may




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   359   360   361   362   363   364   365   366   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin